﻿WEBVTT

00:00:08.407 --> 00:00:10.321
- Okay, sounds like it is.

00:00:10.321 --> 00:00:12.450
I'll be telling you about
adversarial examples

00:00:12.450 --> 00:00:15.292
and adversarial training today.

00:00:15.292 --> 00:00:16.125
Thank you.

00:00:18.100 --> 00:00:20.606
As an overview, I will
start off by telling you

00:00:20.606 --> 00:00:22.871
what adversarial examples are,

00:00:22.871 --> 00:00:26.017
and then I'll explain why they happen,

00:00:26.017 --> 00:00:28.670
why it's possible for them to exist.

00:00:28.670 --> 00:00:31.026
I'll talk a little bit about
how adversarial examples

00:00:31.026 --> 00:00:33.580
pose real world security threats,

00:00:33.580 --> 00:00:36.248
that they can actually
be used to compromise

00:00:36.248 --> 00:00:38.514
systems built on machine learning.

00:00:38.514 --> 00:00:41.302
I'll tell you what the
defenses are so far,

00:00:41.302 --> 00:00:43.986
but mostly defenses are
an open research problem

00:00:43.986 --> 00:00:47.586
that I hope some of you
will move on to tackle.

00:00:47.586 --> 00:00:49.075
And then finally I'll tell you

00:00:49.075 --> 00:00:50.644
how to use adversarial examples

00:00:50.644 --> 00:00:53.156
to improve other machine
learning algorithms

00:00:53.156 --> 00:00:56.020
even if you want to build a
machine learning algorithm

00:00:56.020 --> 00:00:59.270
that won't face a real world adversary.

00:01:00.989 --> 00:01:05.272
Looking at the big picture and
the context for this lecture,

00:01:05.272 --> 00:01:07.511
I think most of you are probably here

00:01:07.511 --> 00:01:10.390
because you've heard
how incredibly powerful

00:01:10.390 --> 00:01:12.692
and successful machine learning is,

00:01:12.692 --> 00:01:14.478
that very many different tasks

00:01:14.478 --> 00:01:17.130
that could not be solved
with software before

00:01:17.130 --> 00:01:20.188
are now solvable thanks to deep learning

00:01:20.188 --> 00:01:23.785
and convolutional networks
and gradient descent.

00:01:23.785 --> 00:01:27.138
All of these technologies
that are working really well.

00:01:27.138 --> 00:01:28.661
Until just a few years ago,

00:01:28.661 --> 00:01:30.988
these technologies didn't really work.

00:01:30.988 --> 00:01:33.868
In about 2013, we started to see

00:01:33.868 --> 00:01:37.036
that deep learning achieved
human level performance

00:01:37.036 --> 00:01:39.018
at a lot of different tasks.

00:01:39.018 --> 00:01:40.993
We saw that convolutional nets

00:01:40.993 --> 00:01:43.228
could recognize objects and images

00:01:43.228 --> 00:01:47.165
and score about the same as
people in those benchmarks,

00:01:47.165 --> 00:01:49.638
with the caveat that
part of the reason that

00:01:49.638 --> 00:01:51.306
algorithms score as well as people

00:01:51.306 --> 00:01:52.761
is that people can't tell

00:01:52.761 --> 00:01:55.410
Alaskan Huskies from
Siberian Huskies very well,

00:01:55.410 --> 00:01:58.559
but modulo the strangeness
of the benchmarks

00:01:58.559 --> 00:02:01.781
deep learning caught up to
about human level performance

00:02:01.781 --> 00:02:05.243
for object recognition in about 2013.

00:02:05.243 --> 00:02:08.547
That same year, we also
saw that object recognition

00:02:08.547 --> 00:02:12.458
applied to human faces caught
up to about human level.

00:02:12.458 --> 00:02:14.709
That suddenly we had computers

00:02:14.709 --> 00:02:17.874
that could recognize
faces about as well as

00:02:17.874 --> 00:02:21.728
you or I could recognize
faces of strangers.

00:02:21.728 --> 00:02:24.642
You can recognize the faces
of your friends and family

00:02:24.642 --> 00:02:27.537
better than a computer,
but when you're dealing

00:02:27.537 --> 00:02:30.152
with people that you haven't
had a lot of experience with

00:02:30.152 --> 00:02:34.306
the computer caught up
to us in about 2013.

00:02:34.306 --> 00:02:36.108
We also saw that computers caught up

00:02:36.108 --> 00:02:40.275
to humans for reading type
written fonts in photos

00:02:41.183 --> 00:02:42.987
in about 2013.

00:02:42.987 --> 00:02:46.401
It even got the point that we
could no longer use CAPTCHAs

00:02:46.401 --> 00:02:50.634
to tell whether a user of
a webpage is human or not

00:02:50.634 --> 00:02:52.439
because the convolutional network

00:02:52.439 --> 00:02:56.496
is better at reading obfuscated
text than a human is.

00:02:56.496 --> 00:02:58.406
So with this context today

00:02:58.406 --> 00:03:00.095
of deep learning working really well

00:03:00.095 --> 00:03:02.019
especially for computer vision

00:03:02.019 --> 00:03:05.136
it's a little bit unusual to think about

00:03:05.136 --> 00:03:07.800
the computer making a mistake.

00:03:07.800 --> 00:03:10.409
Before about 2013,
nobody was ever surprised

00:03:10.409 --> 00:03:12.250
if the computer made a mistake.

00:03:12.250 --> 00:03:14.659
That was the rule not the exception,

00:03:14.659 --> 00:03:16.767
and so today's topic which is all about

00:03:16.767 --> 00:03:20.132
unusual mistakes that deep
learning algorithms make

00:03:20.132 --> 00:03:24.000
this topic wasn't really
a serious avenue of study

00:03:24.000 --> 00:03:28.099
until the algorithms started
to work well most of the time,

00:03:28.099 --> 00:03:31.555
and now people study
the way that they break

00:03:31.555 --> 00:03:36.412
now that that's actually the
exception rather than the rule.

00:03:36.412 --> 00:03:39.168
An adversarial example is an example

00:03:39.168 --> 00:03:43.382
that has been carefully
computed to be misclassified.

00:03:43.382 --> 00:03:45.864
In a lot of cases we're
able to make the new image

00:03:45.864 --> 00:03:48.331
indistinguishable to a human observer

00:03:48.331 --> 00:03:50.226
from the original image.

00:03:50.226 --> 00:03:52.833
Here, I show you one where
we start with a panda.

00:03:52.833 --> 00:03:54.528
On the left this is a panda

00:03:54.528 --> 00:03:57.297
that has not been modified in any way,

00:03:57.297 --> 00:03:59.855
and the convolutional
network trained on the image

00:03:59.855 --> 00:04:03.849
in that dataset is able to
recognize it as being a panda.

00:04:03.849 --> 00:04:05.524
One interesting thing is that the model

00:04:05.524 --> 00:04:08.064
doesn't have a whole lot of
confidence in that decision.

00:04:08.064 --> 00:04:10.656
It assigns about 60% probability

00:04:10.656 --> 00:04:13.411
to this image being a panda.

00:04:13.411 --> 00:04:16.055
If we then compute exactly the way

00:04:16.055 --> 00:04:17.947
that we could modify the image

00:04:17.947 --> 00:04:20.624
to cause the convolutional
network to make a mistake

00:04:20.624 --> 00:04:23.006
we find that the optimal direction

00:04:23.006 --> 00:04:27.017
to move all the pixels is given
by this image in the middle.

00:04:27.017 --> 00:04:29.625
To a human it looks a lot like noise.

00:04:29.625 --> 00:04:31.176
It's not actually noise.

00:04:31.176 --> 00:04:33.244
It's carefully computed as a function

00:04:33.244 --> 00:04:34.883
of the parameters of the network.

00:04:34.883 --> 00:04:36.880
There's actually a lot of structure there.

00:04:36.880 --> 00:04:41.053
If we multiply that image
of the structured attack

00:04:41.053 --> 00:04:45.373
by a very small coefficient and
add it to the original panda

00:04:45.373 --> 00:04:48.131
we get an image that a human can't tell

00:04:48.131 --> 00:04:49.806
from the original panda.

00:04:49.806 --> 00:04:52.811
In fact, on this slide
there is no difference

00:04:52.811 --> 00:04:54.447
between the panda on the left

00:04:54.447 --> 00:04:56.286
and the panda on the right.

00:04:56.286 --> 00:04:58.753
When we present the image
to convolutional network

00:04:58.753 --> 00:05:01.921
we use 32-bit floating point values.

00:05:01.921 --> 00:05:05.142
The monitor here can
only display eight bits

00:05:05.142 --> 00:05:07.712
of color resolution, and
we have made a change

00:05:07.712 --> 00:05:09.279
that's just barely too small

00:05:09.279 --> 00:05:12.613
to affect the smallest
of those eight bits,

00:05:12.613 --> 00:05:14.411
but it effects the other 24

00:05:14.411 --> 00:05:17.345
of the 32-bit floating
point representation,

00:05:17.345 --> 00:05:19.342
and that little tiny change is enough

00:05:19.342 --> 00:05:21.198
to fool the convolutional network

00:05:21.198 --> 00:05:25.365
into recognizing this image
of a panda as being a gibbon.

00:05:26.420 --> 00:05:28.056
Another interesting thing is that

00:05:28.056 --> 00:05:29.857
it doesn't just change the class.

00:05:29.857 --> 00:05:32.881
It's not that we just barely
found the decision boundary

00:05:32.881 --> 00:05:34.734
and just barely stepped across it.

00:05:34.734 --> 00:05:37.702
The convolutional network
actually has much more confidence

00:05:37.702 --> 00:05:40.172
in its incorrect prediction,

00:05:40.172 --> 00:05:42.097
that the image on the right is a gibbon,

00:05:42.097 --> 00:05:45.891
than it had for the
original being a panda.

00:05:45.891 --> 00:05:47.588
On the right, it believes that the image

00:05:47.588 --> 00:05:50.752
is a gibbon with 99.9% probability,

00:05:50.752 --> 00:05:53.848
so before it thought that there was about

00:05:53.848 --> 00:05:57.341
1/3 chance that it was
something other than a panda,

00:05:57.341 --> 00:06:00.238
and now it's about as
certain as it can possibly be

00:06:00.238 --> 00:06:02.417
that it's a gibbon.

00:06:02.417 --> 00:06:05.585
As a little bit of history,
people have studied ways

00:06:05.585 --> 00:06:07.942
of computing attacks to fool

00:06:07.942 --> 00:06:09.656
different machine learning models

00:06:09.656 --> 00:06:13.596
since at least about
2004, and maybe earlier.

00:06:13.596 --> 00:06:15.305
For a long time this
was done in the context

00:06:15.305 --> 00:06:17.772
of fooling spam detectors.

00:06:17.772 --> 00:06:21.406
In about 2013, Battista Biggio found

00:06:21.406 --> 00:06:24.161
that you could fool neural
networks in this way,

00:06:24.161 --> 00:06:27.080
and around the same time my
colleague, Christian Szegedy,

00:06:27.080 --> 00:06:29.311
found that you could
make this kind of attack

00:06:29.311 --> 00:06:30.948
against deep neural networks

00:06:30.948 --> 00:06:33.147
just by using an optimization algorithm

00:06:33.147 --> 00:06:36.368
to search on the input of the image.

00:06:36.368 --> 00:06:37.952
A lot of what I'll be
telling you about today

00:06:37.952 --> 00:06:40.111
is my own follow-up work on this topic,

00:06:40.111 --> 00:06:43.496
but I've spent a lot of my
career over the past few years

00:06:43.496 --> 00:06:46.539
understanding why these
attacks are possible

00:06:46.539 --> 00:06:50.706
and why it's so easy to fool
these convolutional networks.

00:06:52.279 --> 00:06:54.129
When my colleague, Christian,

00:06:54.129 --> 00:06:57.104
first discovered this phenomenon

00:06:57.104 --> 00:07:01.237
independently from Battista
Biggio but around the same time,

00:07:01.237 --> 00:07:04.404
he found that it was actually a result

00:07:05.652 --> 00:07:08.206
of a visualization he was trying to make.

00:07:08.206 --> 00:07:10.260
He wasn't studying security.

00:07:10.260 --> 00:07:12.401
He wasn't studying how
to fool a neural network.

00:07:12.401 --> 00:07:14.686
Instead, he had a convolutional network

00:07:14.686 --> 00:07:16.736
that could recognize objects very well,

00:07:16.736 --> 00:07:19.134
and he wants to understand how it worked,

00:07:19.134 --> 00:07:23.301
so he thought that maybe he
could take an image of a scene,

00:07:24.156 --> 00:07:26.350
for example a picture of a ship,

00:07:26.350 --> 00:07:28.784
and he could gradually
transform that image

00:07:28.784 --> 00:07:31.428
into something that the
network would recognize

00:07:31.428 --> 00:07:33.622
as being an airplane.

00:07:33.622 --> 00:07:35.513
Over the course of that transformation,

00:07:35.513 --> 00:07:38.844
he could see how the
features of the input change.

00:07:38.844 --> 00:07:40.860
You might expect that maybe the background

00:07:34.360 --> 00:07:37.692
would turn blue to look like
the sky behind an airplane,

00:07:44.192 --> 00:07:46.424
or you might expect that the ship

00:07:46.424 --> 00:07:48.883
would grow wings to look
more like an airplane.

00:07:48.883 --> 00:07:51.209
You could conclude from
that that the convolution

00:07:51.209 --> 00:07:56.124
uses the blue sky or uses the
wings to recognize airplanes.

00:07:56.124 --> 00:07:59.019
That's actually not really
what happened at all.

00:07:59.019 --> 00:08:01.212
Each of these panels
here shows an animation

00:08:01.212 --> 00:08:03.737
that you read left to
right, top to bottom.

00:08:03.737 --> 00:08:06.848
Each panel is another
step of gradient ascent

00:08:06.848 --> 00:08:11.441
on the log probability that
the input is an airplane

00:08:11.441 --> 00:08:14.067
according to a convolutional net model,

00:08:14.067 --> 00:08:18.833
and then we follow the gradient
on the input to the image.

00:08:18.833 --> 00:08:20.585
You're probably used to
following the gradient

00:08:20.585 --> 00:08:22.222
on the parameters of a model.

00:08:22.222 --> 00:08:23.840
You can use the back propagation algorithm

00:08:23.840 --> 00:08:26.182
to compute the gradient on the input image

00:08:26.182 --> 00:08:28.001
using exactly the same procedure

00:08:28.001 --> 00:08:29.816
that you would use to compute the gradient

00:08:29.816 --> 00:08:31.976
on the parameters.

00:08:31.976 --> 00:08:34.803
In this animation of the
ship in the upper left,

00:08:34.803 --> 00:08:37.918
we see five panels that all
look basically the same.

00:08:37.918 --> 00:08:39.339
Gradient descent doesn't seem

00:08:39.339 --> 00:08:40.793
to have moved the image at all,

00:08:40.793 --> 00:08:43.496
but by the last panel the
network is completely confident

00:08:43.496 --> 00:08:45.287
that this is an airplane.

00:08:45.287 --> 00:08:47.580
When you first code up
this kind of experiment,

00:08:47.580 --> 00:08:49.433
especially if you don't
know what's going to happen,

00:08:49.433 --> 00:08:51.881
it feels a little bit like
you have a bug in your script

00:08:51.881 --> 00:08:52.937
and you're just displaying

00:08:52.937 --> 00:08:54.761
the same image over and over again.

00:08:54.761 --> 00:08:55.952
The first time I did it,

00:08:55.952 --> 00:08:58.419
I couldn't believe it was happening,

00:08:58.419 --> 00:09:00.540
and I had to open up the images in NumPy,

00:09:00.540 --> 00:09:02.355
and take the difference of them,

00:09:02.355 --> 00:09:03.813
and make sure that there was actually

00:09:03.813 --> 00:09:07.359
a non-zero difference
in there, but there is.

00:09:07.359 --> 00:09:09.250
I show several different animations here

00:09:09.250 --> 00:09:12.333
of a ship, a car, a cat, and a truck.

00:09:13.172 --> 00:09:15.817
The only one where I actually
see any change at all

00:09:15.817 --> 00:09:18.250
is the image of the cat.

00:09:18.250 --> 00:09:21.038
The color of the cat's
face changes a little bit,

00:09:21.038 --> 00:09:23.646
and maybe it becomes a little bit more

00:09:23.646 --> 00:09:25.969
like the color of a metal airplane.

00:09:25.969 --> 00:09:28.470
Other than that, I don't see any changes

00:09:28.470 --> 00:09:29.895
in any of these animations,

00:09:29.895 --> 00:09:33.908
and I don't see anything very
suggestive of an airplane.

00:09:33.908 --> 00:09:36.985
So gradient descent, rather
than turning the input

00:09:36.985 --> 00:09:39.240
into an example of an airplane,

00:09:39.240 --> 00:09:42.818
has found an image that fools the network

00:09:42.818 --> 00:09:45.519
into thinking that the
input is an airplane.

00:09:45.519 --> 00:09:47.050
And if we were malicious attackers

00:09:47.050 --> 00:09:49.567
we didn't even have to work
very hard to figure out

00:09:49.567 --> 00:09:51.102
how to fool the network.

00:09:51.102 --> 00:09:52.234
We just asked the network

00:09:52.234 --> 00:09:53.837
to give us an image of an airplane,

00:09:53.837 --> 00:09:56.516
and it gave us something
that fools it into thinking

00:09:56.516 --> 00:09:59.016
that the input is an airplane.

00:10:00.310 --> 00:10:02.727
When Christian first published this work,

00:10:02.727 --> 00:10:05.175
a lot of articles came
out with titles like,

00:10:05.175 --> 00:10:07.210
The Flaw Looking At Every
Deep Neural Network,

00:10:07.210 --> 00:10:10.590
or Deep Learning has Deep Flaws.

00:10:10.590 --> 00:10:12.577
It's important to remember
that these vulnerabilities

00:10:12.577 --> 00:10:15.903
apply to essentially every
machine learning algorithm

00:10:15.903 --> 00:10:18.625
that we've studied so far.

00:10:18.625 --> 00:10:20.458
Some of them like RBF networks

00:10:20.458 --> 00:10:22.906
and partisan density estimators

00:10:22.906 --> 00:10:24.942
are able to resist this effect somewhat,

00:10:24.942 --> 00:10:27.908
but even very simple
machine learning algorithms

00:10:27.908 --> 00:10:32.069
are highly vulnerable
to adversarial examples.

00:10:32.069 --> 00:10:33.870
In this image, I show an animation

00:10:33.870 --> 00:10:37.038
of what happens when we
attack a linear model,

00:10:37.038 --> 00:10:38.890
so it's not a deep algorithm at all.

00:10:38.890 --> 00:10:41.370
It's just a shallow softmax model.

00:10:41.370 --> 00:10:45.440
You multiply by a matrix, you
add a vector of bias terms,

00:10:45.440 --> 00:10:47.223
you apply the softmax function,

00:10:47.223 --> 00:10:48.846
and you've got your
probability distribution

00:10:48.846 --> 00:10:51.249
over the 10 MNIST classes.

00:10:51.249 --> 00:10:54.022
At the upper left, I start
with an image of a nine,

00:10:54.022 --> 00:10:57.161
and then as we move left
to right, top to bottom,

00:10:57.161 --> 00:11:00.141
I gradually transform it to be a zero.

00:11:00.141 --> 00:11:02.053
Where I've drawn the yellow box,

00:11:02.053 --> 00:11:05.640
the model assigns high
probability to it being a zero.

00:11:05.640 --> 00:11:08.323
I forget exactly what my threshold
was for high probability,

00:11:08.323 --> 00:11:11.856
but I think it was around 0.9 or so.

00:11:11.856 --> 00:11:13.503
Then as we move to the second row,

00:11:13.503 --> 00:11:15.462
I transform it into a one,

00:11:15.462 --> 00:11:17.136
and the second yellow box indicates

00:11:17.136 --> 00:11:18.932
where we've successfully fooled the model

00:11:18.932 --> 00:11:21.663
into thinking it's a one
with high probability.

00:11:21.663 --> 00:11:23.878
And then as you read the
rest of the yellow boxes

00:11:23.878 --> 00:11:25.250
left to right, top to bottom,

00:11:25.250 --> 00:11:27.691
we go through the twos,
threes, fours, and so on,

00:11:27.691 --> 00:11:29.646
until finally at the lower right

00:11:29.646 --> 00:11:31.855
we have a nine that has
a yellow box around it,

00:11:31.855 --> 00:11:33.794
and it actually looks like a nine,

00:11:33.794 --> 00:11:35.001
but in this case the only reason

00:11:35.001 --> 00:11:36.185
it actually looks like a nine

00:11:36.185 --> 00:11:39.369
is that we started the
whole process with a nine.

00:11:39.369 --> 00:11:43.042
We successfully swept through
all 10 classes of MNIST

00:11:43.042 --> 00:11:46.892
without substantially changing
the image of the digit

00:11:46.892 --> 00:11:50.578
in any way that would interfere
with human recognition.

00:11:50.578 --> 00:11:54.745
This linear model was actually
extremely easy to fool.

00:11:55.879 --> 00:11:57.791
Besides linear models, we've also seen

00:11:57.791 --> 00:12:01.480
that we can fool many different
kinds of linear models

00:12:01.480 --> 00:12:04.588
including logistic regression and SVMs.

00:12:04.588 --> 00:12:07.118
We've also found that we
can fool decision trees,

00:12:07.118 --> 00:12:11.285
and to a lesser extent,
nearest neighbors classifiers.

00:12:13.049 --> 00:12:16.605
We wanted to explain
exactly why this happens.

00:12:16.605 --> 00:12:20.122
Back in about 2014, after we'd
published the original paper

00:12:20.122 --> 00:12:22.934
where we'd said that these problems exist,

00:12:22.934 --> 00:12:25.929
we were trying to figure
out why they happen.

00:12:25.929 --> 00:12:27.394
When we wrote our first paper,

00:12:27.394 --> 00:12:30.517
we thought that basically
this is a form of overfitting,

00:12:30.517 --> 00:12:34.087
that you have a very
complicated deep neural network,

00:12:34.087 --> 00:12:36.086
it learns to fit the training set,

00:12:36.086 --> 00:12:39.604
its behavior on the test
set is somewhat undefined,

00:12:39.604 --> 00:12:41.858
and then it makes random mistakes

00:12:41.858 --> 00:12:44.023
that an attacker can exploit.

00:12:44.023 --> 00:12:45.778
Let's walk through what
that story looks like

00:12:45.778 --> 00:12:47.650
somewhat concretely.

00:12:47.650 --> 00:12:50.885
I have here a training
set of three blue X's

00:12:50.885 --> 00:12:53.105
and three green O's.

00:12:53.105 --> 00:12:54.364
We want to make a classifier

00:12:54.364 --> 00:12:57.435
that can recognize X's and recognize O's.

00:12:57.435 --> 00:12:59.806
We have a very complicated classifier

00:12:59.806 --> 00:13:01.972
that can easily fit the training set,

00:13:01.972 --> 00:13:03.633
so we represent everywhere it believes

00:13:03.633 --> 00:13:06.486
X's should be with blobs of blue color,

00:13:06.486 --> 00:13:08.369
and I've drawn a blob of blue

00:13:08.369 --> 00:13:10.629
around all of the training set X's,

00:13:10.629 --> 00:13:13.157
so it correctly classifies
the training set.

00:13:13.157 --> 00:13:17.840
It also has a blob of green
mass showing where the O's are,

00:13:17.840 --> 00:13:21.360
and it successfully fits all
of the green training set O's,

00:13:21.360 --> 00:13:24.482
but then because this is a
very complicated function

00:13:24.482 --> 00:13:26.850
and it has just way more parameters

00:13:26.850 --> 00:13:29.998
than it actually needs to
represent the training task,

00:13:29.998 --> 00:13:33.168
it throws little blobs of probability mass

00:13:33.168 --> 00:13:35.680
around the rest of space randomly.

00:13:35.680 --> 00:13:37.566
On the left there's a blob of green space

00:13:37.566 --> 00:13:40.121
that's kind of near the training set X's,

00:13:40.121 --> 00:13:42.032
and I've drawn a red X there to show

00:13:42.032 --> 00:13:43.740
that maybe this would be
an adversarial example

00:13:43.740 --> 00:13:46.441
where we expect the
classification to be X,

00:13:46.441 --> 00:13:48.570
but the model assigns O.

00:13:48.570 --> 00:13:51.663
On the right, I've shown
that there's a red O

00:13:51.663 --> 00:13:53.826
where we have another adversarial example.

00:13:53.826 --> 00:13:55.655
We're very near the other O's.

00:13:55.655 --> 00:13:58.175
We might expect the model to
assign this class to be an O,

00:13:58.175 --> 00:14:00.375
and yet because it's drawn blue mass there

00:14:00.375 --> 00:14:04.060
it's actually assigning it to be an X.

00:14:04.060 --> 00:14:05.614
If overfitting is really the story

00:14:05.614 --> 00:14:09.105
then each adversarial
example is more or less

00:14:09.105 --> 00:14:12.877
the result of bad luck and
also more or less unique.

00:14:12.877 --> 00:14:14.455
If we fit the model again

00:14:14.455 --> 00:14:16.378
or we fit a slightly different model

00:14:16.378 --> 00:14:19.137
we would expect to make
different random mistakes

00:14:19.137 --> 00:14:22.338
on this points that are
off the training set,

00:14:22.338 --> 00:14:25.131
but that was actually
not what we found at all.

00:14:25.131 --> 00:14:28.017
We found that many different
models would misclassify

00:14:28.017 --> 00:14:30.533
the same adversarial examples,

00:14:30.533 --> 00:14:33.271
and they would assign
the same class to them.

00:14:33.271 --> 00:14:36.191
We also found that if
we took the difference

00:14:36.191 --> 00:14:40.429
between an original example
and an adversarial example

00:14:40.429 --> 00:14:43.226
then we had a direction in input space

00:14:43.226 --> 00:14:46.719
and we could add that same offset vector

00:14:46.719 --> 00:14:49.234
to any clean example, and
we would almost always

00:14:49.234 --> 00:14:52.067
get an adversarial example as a result.

00:14:52.067 --> 00:14:52.935
So we started to realize

00:14:52.935 --> 00:14:55.283
that there was systematic
effect going on here,

00:14:55.283 --> 00:14:57.842
not just a random effect.

00:14:57.842 --> 00:14:59.368
That led us to another idea

00:14:59.368 --> 00:15:01.317
which is that adversarial examples

00:15:01.317 --> 00:15:03.537
might actually be more like underfitting

00:15:03.537 --> 00:15:05.538
rather than overfitting.

00:15:05.538 --> 00:15:09.141
They might actually come from
the model being too linear.

00:15:09.141 --> 00:15:11.267
Here I draw the same task again

00:15:11.267 --> 00:15:13.655
where we have the same manifold of O's

00:15:13.655 --> 00:15:15.929
and the same line of X's,

00:15:15.929 --> 00:15:19.205
and this time I fit a
linear model to the data set

00:15:19.205 --> 00:15:23.772
rather than fitting a high
capacity, non-linear model to it.

00:15:23.772 --> 00:15:26.103
We see that we get a dividing hyperplane

00:15:26.103 --> 00:15:29.082
running in between the two classes.

00:15:29.082 --> 00:15:30.877
This hyperplane doesn't really capture

00:15:30.877 --> 00:15:33.803
the true structure of the classes.

00:15:33.803 --> 00:15:37.167
The O's are clearly arranged
in a C-shaped manifold.

00:15:37.167 --> 00:15:40.310
If we keep walking past
the end of the O's,

00:15:40.310 --> 00:15:43.734
we've crossed the decision
boundary and we've drawn a red O

00:15:43.734 --> 00:15:46.432
where even though we're very
near the decision boundary

00:15:46.432 --> 00:15:49.688
and near other O's we
believe that it is now an X.

00:15:49.688 --> 00:15:53.036
Similarly we can take
steps that go from near X's

00:15:53.036 --> 00:15:57.646
to just over the line that
are classified as O's.

00:15:57.646 --> 00:15:59.638
Another thing that's somewhat
unusual about this plot

00:15:59.638 --> 00:16:03.208
is that if we look at the lower
left or upper right corners

00:16:03.208 --> 00:16:05.428
these corners are very
confidently classified

00:16:05.428 --> 00:16:09.538
as being X's on the lower
left or O's on the upper right

00:16:09.538 --> 00:16:12.498
even though we've never seen
any data over there at all.

00:16:12.498 --> 00:16:14.710
The linear model family forces the model

00:16:14.710 --> 00:16:17.604
to have very high
confidence in these regions

00:16:17.604 --> 00:16:21.354
that are very far from
the decision boundary.

00:16:22.757 --> 00:16:25.923
We've seen that linear
models can actually assign

00:16:25.923 --> 00:16:28.478
really unusual confidence
as you move very far

00:16:28.478 --> 00:16:30.016
from the decision boundary,

00:16:30.016 --> 00:16:31.828
even if there isn't any data there,

00:16:31.828 --> 00:16:34.106
but are deep neural networks actually

00:16:34.106 --> 00:16:36.326
anything like linear models?

00:16:36.326 --> 00:16:38.598
Could linear models
actually explain anything

00:16:38.598 --> 00:16:41.190
about how it is that
deep neural nets fail?

00:16:41.190 --> 00:16:43.114
It turns out that modern deep neural nets

00:16:43.114 --> 00:16:45.482
are actually very piecewise linear,

00:16:45.482 --> 00:16:47.648
so rather than being a
single linear function

00:16:47.648 --> 00:16:49.162
they are piecewise linear

00:16:49.162 --> 00:16:52.412
with maybe not that many linear pieces.

00:16:53.588 --> 00:16:55.378
If we use rectified linear units

00:16:55.378 --> 00:16:59.545
then the mapping from the input
image to the output logits

00:17:00.460 --> 00:17:03.662
is literally a piecewise linear function.

00:17:03.662 --> 00:17:06.750
By the logits I mean the
un-normalized log probabilities

00:17:06.750 --> 00:17:11.701
before we apply the softmax
op at the output of the model.

00:17:11.701 --> 00:17:13.161
There are other neural networks

00:17:13.161 --> 00:17:14.955
like maxout networks that are also

00:17:14.955 --> 00:17:17.145
literally piecewise linear.

00:17:17.146 --> 00:17:19.915
And then there are several
that become very close to it.

00:17:19.915 --> 00:17:22.627
Before rectified linear
units became popular

00:17:22.627 --> 00:17:27.019
most people used to use sigmoid
units of one form or another

00:17:27.019 --> 00:17:30.369
either logistic sigmoid or
hyperbolic tangent units.

00:17:30.369 --> 00:17:33.624
These sigmoidal units have
to be carefully tuned,

00:17:33.624 --> 00:17:35.715
especially at initialization

00:17:35.715 --> 00:17:37.936
so that you spend most of your time

00:17:37.936 --> 00:17:40.396
near the center of the sigmoid

00:17:40.396 --> 00:17:43.527
where the sigmoid is approximately linear.

00:17:43.527 --> 00:17:46.578
Then finally, the LSTM, a
kind of recurrent network

00:17:46.578 --> 00:17:49.641
that is one of the most popular
recurrent networks today,

00:17:49.641 --> 00:17:52.769
uses addition from one
time step to the next

00:17:52.769 --> 00:17:56.859
in order to accumulate and
remember information over time.

00:17:56.859 --> 00:18:00.021
Addition is a particularly
simple form of linearity,

00:18:00.021 --> 00:18:01.501
so we can see that the interaction

00:18:01.501 --> 00:18:06.055
from a very distant time step
in the past and the present

00:18:06.055 --> 00:18:09.330
is highly linear within an LSTM.

00:18:09.330 --> 00:18:11.647
Now to be clear, I'm
speaking about the mapping

00:18:11.647 --> 00:18:14.417
from the input of the model
to the output of the model.

00:18:14.417 --> 00:18:17.155
That's what I'm saying
is close to being linear

00:18:17.155 --> 00:18:21.128
or is piecewise linear
with relatively few pieces.

00:18:21.128 --> 00:18:23.351
The mapping from the
parameters of the network

00:18:23.351 --> 00:18:26.125
to the output of the network is non-linear

00:18:26.125 --> 00:18:29.345
because the weight matrices
at each layer of the network

00:18:29.345 --> 00:18:31.394
are multiplied together.

00:18:31.394 --> 00:18:34.249
So we actually get extremely
non-linear reactions

00:18:34.249 --> 00:18:36.434
between parameters and the output.

00:18:36.434 --> 00:18:39.348
That's what makes training a
neural network so difficult.

00:18:39.348 --> 00:18:42.315
But the mapping from
the input to the output

00:18:42.315 --> 00:18:45.177
is much more linear and predictable,

00:18:45.177 --> 00:18:47.347
and it means that optimization problems

00:18:47.347 --> 00:18:50.938
that aim to optimize
the input to the model

00:18:50.938 --> 00:18:53.600
are much easier than optimization problems

00:18:53.600 --> 00:18:57.169
that aim to optimize the parameters.

00:18:57.169 --> 00:18:59.631
If we go and look for
this happening in practice

00:18:59.631 --> 00:19:01.870
we can take a convolutional network

00:19:01.870 --> 00:19:04.273
and trace out a one-dimensional path

00:19:04.273 --> 00:19:07.013
through its input space.

00:19:07.013 --> 00:19:09.818
So what we're doing here is
we're choosing a clean example.

00:19:09.818 --> 00:19:12.763
It's an image of a white
car on a red background,

00:19:12.763 --> 00:19:14.856
and we are choosing a direction

00:19:14.856 --> 00:19:16.623
that will travel through space.

00:19:16.623 --> 00:19:19.403
We are going to have a coefficient epsilon

00:19:19.403 --> 00:19:21.273
that we multiply by this direction.

00:19:21.273 --> 00:19:22.848
When epsilon is negative 30,

00:19:22.848 --> 00:19:24.544
like at the left end of the plot,

00:19:24.544 --> 00:19:28.266
we're subtracting off a lot
of this unit vector direction.

00:19:28.266 --> 00:19:30.945
When epsilon is zero, like
in the middle of the plot,

00:19:30.945 --> 00:19:33.964
we're visiting the original
image from the data set,

00:19:33.964 --> 00:19:36.074
and when epsilon is positive 30,

00:19:36.074 --> 00:19:37.645
like at the right end of the plot,

00:19:37.645 --> 00:19:41.228
we're adding this
direction onto the input.

00:19:42.622 --> 00:19:45.079
In the panel on the left,
I show you an animation

00:19:45.079 --> 00:19:47.666
where we move from
epsilon equals negative 30

00:19:47.666 --> 00:19:50.820
as up to epsilon equals positive 30.

00:19:50.820 --> 00:19:53.581
You read the animation left
to right, top to bottom,

00:19:53.581 --> 00:19:56.031
and everywhere that there's a yellow box

00:19:56.031 --> 00:20:00.198
the input has correctly
recognized as being a car.

00:20:01.379 --> 00:20:04.354
On the upper left, you see
that it looks mostly blue.

00:20:04.354 --> 00:20:07.817
On the lower right, it's
hard to tell what's going on.

00:20:07.817 --> 00:20:10.381
It's kind of reddish and so on.

00:20:10.381 --> 00:20:13.772
In the middle row, just after
where the yellow boxes end

00:20:13.772 --> 00:20:14.995
you can see pretty clearly

00:20:14.995 --> 00:20:17.324
that it's a car on a red background,

00:20:17.324 --> 00:20:20.747
though the image is small on these slides.

00:20:20.747 --> 00:20:23.780
What's interesting to
look at here is the logits

00:20:23.780 --> 00:20:25.168
that the model outputs.

00:20:25.168 --> 00:20:30.115
This is a deep convolutional
rectified linear unit network.

00:20:30.115 --> 00:20:32.326
Because it uses rectified linear units,

00:20:32.326 --> 00:20:36.160
we know that the output is
a piecewise linear function

00:20:36.160 --> 00:20:38.559
of the input to the model.

00:20:38.559 --> 00:20:40.835
The main question we're
asking by making this plot

00:20:40.835 --> 00:20:42.820
is how many different pieces

00:20:42.820 --> 00:20:45.628
does this piecewise linear function have

00:20:45.628 --> 00:20:48.552
if we look at one
particular cross section.

00:20:48.552 --> 00:20:50.835
You might think that maybe a deep net

00:20:50.835 --> 00:20:52.135
is going to represent some extremely

00:20:52.135 --> 00:20:54.749
wiggly complicated
function with lots and lots

00:20:54.749 --> 00:20:58.326
of linear pieces no matter
which cross section you look in.

00:20:58.326 --> 00:21:01.408
Or we might find that it
has more or less two pieces

00:21:01.408 --> 00:21:03.825
for each function we look at.

00:21:04.667 --> 00:21:07.201
Each of the different curves on this plot

00:21:07.201 --> 00:21:10.245
is the logits for a different class.

00:21:10.245 --> 00:21:13.864
We see that out at the tails of the plot

00:21:13.864 --> 00:21:16.528
that the frog class is the most likely,

00:21:16.528 --> 00:21:18.846
and the frog class basically looks like

00:21:18.846 --> 00:21:20.846
a big v-shaped function.

00:21:21.928 --> 00:21:24.193
The logits for the frog
class become very high

00:21:24.193 --> 00:21:27.270
when epsilon is negative
30 or positive 30,

00:21:27.270 --> 00:21:29.253
and they drop down and
become a little bit negative

00:21:29.253 --> 00:21:31.003
when epsilon is zero.

00:21:32.833 --> 00:21:36.250
The car class, listed as automobile here,

00:21:37.764 --> 00:21:39.856
it's actually high in the middle,

00:21:39.856 --> 00:21:42.950
and the car is correctly recognized.

00:21:42.950 --> 00:21:44.944
As we sweep out to very negative epsilon,

00:21:44.944 --> 00:21:47.397
the logits for the car class do increase,

00:21:47.397 --> 00:21:49.033
but they don't increase nearly as quickly

00:21:49.033 --> 00:21:51.553
as the logits for the frog class.

00:21:51.553 --> 00:21:52.811
So, we've found a direction

00:21:52.811 --> 00:21:54.793
that's associated with the frog class

00:21:54.793 --> 00:21:59.041
and as we follow it out to a
relatively large perturbation,

00:21:59.041 --> 00:22:02.334
we find that the model
extrapolates linearly

00:22:02.334 --> 00:22:04.873
and begins to make a very
unreasonable prediction

00:22:04.873 --> 00:22:07.984
that the frog class is extremely likely

00:22:07.984 --> 00:22:09.971
just because we've moved for a long time

00:22:09.971 --> 00:22:12.073
in this direction that
was locally associated

00:22:12.073 --> 00:22:15.240
with the frog class being more likely.

00:22:17.550 --> 00:22:20.694
When we actually go and
construct adversarial examples,

00:22:20.694 --> 00:22:23.200
we need to remember that we're able to get

00:22:23.200 --> 00:22:24.784
quite a large perturbation

00:22:24.784 --> 00:22:26.829
without changing the image very much

00:22:26.829 --> 00:22:29.912
as far as a human being is concerned.

00:22:30.882 --> 00:22:33.852
So here I show you a
handwritten digit three,

00:22:33.852 --> 00:22:36.395
and I'm going to change it
in several different ways,

00:22:36.395 --> 00:22:37.923
and all of these changes have

00:22:37.923 --> 00:22:40.806
the same L2 norm perturbation.

00:22:40.806 --> 00:22:44.421
In the top row, I'm going to
change the three into a seven

00:22:44.421 --> 00:22:47.752
just by looking for the nearest
seven in the training set.

00:22:47.752 --> 00:22:49.518
The difference between those two

00:22:49.518 --> 00:22:53.527
is this image that looks a
little bit like the seven

00:22:53.527 --> 00:22:55.187
wrapped in some black lines.

00:22:55.187 --> 00:22:57.813
So here white pixels in the middle image

00:22:57.813 --> 00:22:59.808
in the perturbation column,

00:22:59.808 --> 00:23:02.184
the white pixels
represent adding something

00:23:02.184 --> 00:23:04.830
and black pixels represent
subtracting something

00:23:04.830 --> 00:23:08.142
as you move from the left
column to the right column.

00:23:08.142 --> 00:23:11.401
So when we take the three and
we apply this perturbation

00:23:11.401 --> 00:23:13.417
that transforms it into a seven,

00:23:13.417 --> 00:23:16.531
we can measure the L2
norm of that perturbation.

00:23:16.531 --> 00:23:20.236
And it turns out to
have an L2 norm of 3.96.

00:23:20.236 --> 00:23:21.818
That gives you kind of a reference

00:23:21.818 --> 00:23:24.790
for how big these perturbations can be.

00:23:24.790 --> 00:23:26.521
In the middle row, we apply a perturbation

00:23:26.521 --> 00:23:28.302
of exactly the same size,

00:23:28.302 --> 00:23:30.500
but with the direction chosen randomly.

00:23:30.500 --> 00:23:32.065
In this case we don't actually change

00:23:32.065 --> 00:23:33.720
the class of the three at all,

00:23:33.720 --> 00:23:35.377
we just get some random noise

00:23:35.377 --> 00:23:37.825
that didn't really change the class.

00:23:37.825 --> 00:23:41.373
A human could still easily
read it as being a three.

00:23:41.373 --> 00:23:44.285
And then finally at the very bottom row,

00:23:44.285 --> 00:23:46.230
we take the three and we
just erase a piece of it

00:23:46.230 --> 00:23:48.011
with a perturbation of the same norm

00:23:48.011 --> 00:23:50.334
and we turn it into something

00:23:50.334 --> 00:23:52.422
that doesn't have any class at all.

00:23:52.422 --> 00:23:53.714
It's not a three, it's not a seven,

00:23:53.714 --> 00:23:56.254
it's just a defective input.

00:23:56.254 --> 00:23:57.568
All of these changes can happen

00:23:57.568 --> 00:24:00.664
with the same L2 norm perturbation.

00:24:00.664 --> 00:24:03.025
And actually a lot of the time
with adversarial examples,

00:24:03.025 --> 00:24:06.011
you make perturbations that
have an even larger L2 norm.

00:24:06.011 --> 00:24:07.216
What's going on is that

00:24:07.216 --> 00:24:09.143
there are several different
pixels in the image,

00:24:09.143 --> 00:24:12.131
and so small changes to individual pixels

00:24:12.131 --> 00:24:15.227
can add up to relatively large vectors.

00:24:15.227 --> 00:24:17.566
For larger datasets like ImageNet,

00:24:17.566 --> 00:24:18.990
where there's even more pixels,

00:24:18.990 --> 00:24:21.184
you can make very small
changes to each pixel

00:24:21.184 --> 00:24:24.174
that travel very far in vector space

00:24:24.174 --> 00:24:26.368
as measured by the L2 norm.

00:24:26.368 --> 00:24:28.505
That means that you can
actually make changes

00:24:28.505 --> 00:24:30.093
that are almost imperceptible

00:24:30.093 --> 00:24:31.605
but actually move you really far

00:24:31.605 --> 00:24:33.477
and get a large dot product

00:24:33.477 --> 00:24:36.137
with the coefficients
of the linear function

00:24:36.137 --> 00:24:38.695
that the model represents.

00:24:38.695 --> 00:24:39.832
It also means that when

00:24:39.832 --> 00:24:41.467
we're constructing adversarial examples,

00:24:41.467 --> 00:24:44.838
we need to make sure that the
adversarial example procedure

00:24:44.838 --> 00:24:46.022
isn't able to do what happened

00:24:46.022 --> 00:24:48.240
in the top row of this slide here.

00:24:48.240 --> 00:24:49.627
So in the top row of this slide,

00:24:49.627 --> 00:24:50.756
we took the three and we actually

00:24:50.756 --> 00:24:52.454
just changed it into a seven.

00:24:52.454 --> 00:24:53.856
So when the model says that the image

00:24:53.856 --> 00:24:56.232
in the upper right is a
seven, it's not a mistake.

00:24:56.232 --> 00:24:59.145
We actually just changed the input class.

00:24:59.145 --> 00:25:00.499
When we build adversarial examples,

00:25:00.499 --> 00:25:02.928
we want to make sure that
we're measuring real mistakes.

00:25:02.928 --> 00:25:04.459
If we're experimenters studying

00:25:04.459 --> 00:25:06.259
how easy a network is to fool,

00:25:06.259 --> 00:25:08.146
we want to make sure that
we're actually fooling it

00:25:08.146 --> 00:25:11.515
and not just changing the input class.

00:25:11.515 --> 00:25:13.535
And if we're an attacker, we
actually want to make sure

00:25:13.535 --> 00:25:17.457
that we're causing
misbehavior in the system.

00:25:17.457 --> 00:25:19.689
To do that, when we build
adversarial examples,

00:25:19.689 --> 00:25:24.134
we use the maxnorm to
constrain the perturbation.

00:25:24.134 --> 00:25:26.726
Basically this says
that no pixel can change

00:25:26.726 --> 00:25:28.812
by more than some amount epsilon.

00:25:28.812 --> 00:25:30.991
So the L2 norm can get really big,

00:25:30.991 --> 00:25:33.335
but you can't concentrate all the changes

00:25:33.335 --> 00:25:35.908
for that L2 norm to erase
pieces of the digit,

00:25:35.908 --> 00:25:39.701
like in the bottom row here
we erased the top of a three.

00:25:39.701 --> 00:25:42.604
One very fast way to build
an adversarial example

00:25:42.604 --> 00:25:45.503
is just to take the gradient of the cost

00:25:45.503 --> 00:25:47.140
that you used to train the network

00:25:47.140 --> 00:25:48.663
with respect to the input,

00:25:48.663 --> 00:25:51.312
and then take the sign of that gradient.

00:25:51.312 --> 00:25:55.708
The sign is essentially
enforcing the maxnorm constraint.

00:25:55.708 --> 00:25:58.550
You're only allowed to change the input by

00:25:58.550 --> 00:26:00.690
up to epsilon at each pixel,

00:26:00.690 --> 00:26:02.381
so if you just take the sign it tells you

00:26:02.381 --> 00:26:04.761
whether you want to add
epsilon or subtract epsilon

00:26:04.761 --> 00:26:07.010
in order to hurt the network.

00:26:07.010 --> 00:26:08.844
You can view this as
taking the observation

00:26:08.844 --> 00:26:10.790
that the network is more or less linear,

00:26:10.790 --> 00:26:12.211
as we showed on this slide,

00:26:12.211 --> 00:26:14.265
and using that to motivate

00:26:14.265 --> 00:26:17.918
building a first order
Taylor series approximation

00:26:17.918 --> 00:26:21.105
of the neural network's cost.

00:26:21.105 --> 00:26:24.508
And then subject to that
Taylor series approximation,

00:26:24.508 --> 00:26:26.106
we want to maximize the cost

00:26:26.106 --> 00:26:28.898
following this maxnorm constraint.

00:26:28.898 --> 00:26:30.590
And that gives us this
technique that we call

00:26:30.590 --> 00:26:32.785
the fast gradient sign method.

00:26:32.785 --> 00:26:34.350
If you want to just get your hands dirty

00:26:34.350 --> 00:26:36.835
and start making adversarial
examples really quickly,

00:26:36.835 --> 00:26:38.764
or if you have an algorithm
where you want to train

00:26:38.764 --> 00:26:41.534
on adversarial examples in
the inner loop of learning,

00:26:41.534 --> 00:26:43.402
this method will make
adversarial examples for you

00:26:43.402 --> 00:26:45.134
very, very quickly.

00:26:45.134 --> 00:26:47.942
In practice you should
also use other methods,

00:26:47.942 --> 00:26:50.353
like Nicholas Carlini's attack based on

00:26:50.353 --> 00:26:52.660
multiple steps of the Adam optimizer,

00:26:52.660 --> 00:26:55.212
to make sure that you
have a very strong attack

00:26:55.212 --> 00:26:57.359
that you bring out when
you think you have a model

00:26:57.359 --> 00:26:59.678
that might be more powerful.

00:26:59.678 --> 00:27:02.145
A lot of the time people
find that they can defeat

00:27:02.145 --> 00:27:03.460
the fast gradient sign method

00:27:03.460 --> 00:27:05.740
and think that they've
built a successful defense,

00:27:05.740 --> 00:27:08.769
but then when you bring
out a more powerful method

00:27:08.769 --> 00:27:10.444
that takes longer to evaluate,

00:27:10.444 --> 00:27:12.566
they find that they can't overcome

00:27:12.566 --> 00:27:16.066
the more computationally expensive attack.

00:27:18.043 --> 00:27:20.090
I've told you that
adversarial examples happen

00:27:20.090 --> 00:27:22.036
because the model is very linear.

00:27:22.036 --> 00:27:23.529
And then I told you that we could

00:27:23.529 --> 00:27:25.132
use this linearity assumption

00:27:25.132 --> 00:27:28.694
to build this attack, the
fast gradient sign method.

00:27:28.694 --> 00:27:31.900
This method, when applied
to a regular neural network

00:27:31.900 --> 00:27:34.079
that doesn't have any special defenses,

00:27:34.079 --> 00:27:38.328
will get over a 99% attack success rate.

00:27:38.328 --> 00:27:40.377
So that seems to confirm, somewhat,

00:27:40.377 --> 00:27:42.936
this hypothesis that adversarial examples

00:27:42.936 --> 00:27:45.054
come from the model being far too linear

00:27:45.054 --> 00:27:48.964
and extrapolating in linear
fashions when it shouldn't.

00:27:48.964 --> 00:27:51.514
Well we can actually go
looking for some more evidence.

00:27:51.514 --> 00:27:54.417
My friend David Warde-Farley
and I built these maps

00:27:54.417 --> 00:27:57.172
of the decision boundaries
of neural networks.

00:27:57.172 --> 00:27:58.809
And we found that they are consistent

00:27:58.809 --> 00:28:02.140
with the linearity hypothesis.

00:28:02.140 --> 00:28:04.478
So the FGSM is that attack method

00:28:04.478 --> 00:28:06.244
that I described in the previous slide,

00:28:06.244 --> 00:28:08.260
where we take the sign of the gradient.

00:28:08.260 --> 00:28:09.537
We'd like to build a map

00:28:09.537 --> 00:28:13.353
of a two-dimensional cross
section of input space

00:28:13.353 --> 00:28:15.760
and show which classes are assigned

00:28:15.760 --> 00:28:18.556
to the data at each point.

00:28:18.556 --> 00:28:21.397
In the grid on the right,
each different cell,

00:28:21.397 --> 00:28:23.308
each little square within the grid,

00:28:23.308 --> 00:28:27.715
is a map of a CIFAR-10
classifier's decision boundary,

00:28:27.715 --> 00:28:29.932
with each cell
corresponding to a different

00:28:29.932 --> 00:28:32.668
CIFAR-10 testing sample.

00:28:32.668 --> 00:28:34.624
On the left I show you a little legend

00:28:34.624 --> 00:28:37.867
where you can understand
what each cell means.

00:28:37.867 --> 00:28:40.927
The very center of each
cell corresponds to

00:28:40.927 --> 00:28:43.338
the original example
from the CIFAR-10 dataset

00:28:43.338 --> 00:28:45.590
with no modification.

00:28:45.590 --> 00:28:47.534
As we move left to right in the cell,

00:28:47.534 --> 00:28:48.561
we're moving in the direction

00:28:48.561 --> 00:28:50.918
of the fast gradient sign method attack.

00:28:50.918 --> 00:28:53.076
So just the sign of the gradient.

00:28:53.076 --> 00:28:54.897
As we move up and down within the cell,

00:28:54.897 --> 00:28:58.243
we're moving in a random
direction that's orthogonal to

00:28:58.243 --> 00:29:00.907
the fast gradient sign method direction.

00:29:00.907 --> 00:29:04.204
So we get to see a cross
section, a 2D cross section

00:29:04.204 --> 00:29:06.454
of CIFAR-10 decision space.

00:29:07.455 --> 00:29:09.604
At each pixel within this map,

00:29:09.604 --> 00:29:13.291
we plot a color that tells us
which class is assigned there.

00:29:13.291 --> 00:29:15.199
We use white pixels to indicate that

00:29:15.199 --> 00:29:17.174
the correct class was chosen,

00:29:17.174 --> 00:29:19.538
and then we used different
colors to represent

00:29:19.538 --> 00:29:21.931
all of the other incorrect classes.

00:29:21.931 --> 00:29:23.908
You can see that in nearly all

00:29:23.908 --> 00:29:25.641
of the grid cells on the right,

00:29:25.641 --> 00:29:29.222
roughly the left half
of the image is white.

00:29:29.222 --> 00:29:31.564
So roughly the left half of the image

00:29:31.564 --> 00:29:33.648
has been correctly classified.

00:29:33.648 --> 00:29:36.761
As we move to the right, we
see that there is usually

00:29:36.761 --> 00:29:39.537
a different color on the right half.

00:29:39.537 --> 00:29:41.441
And the boundaries between these regions

00:29:41.441 --> 00:29:43.118
are approximately linear.

00:29:43.118 --> 00:29:45.153
What's going on here is that
the fast gradient sign method

00:29:45.153 --> 00:29:47.116
has identified a direction

00:29:47.116 --> 00:29:50.283
where if we get a large dot
product with that direction

00:29:50.283 --> 00:29:52.694
we can get an adversarial example.

00:29:52.694 --> 00:29:54.729
And from this we can see
that adversarial examples

00:29:54.729 --> 00:29:57.896
live more or less in linear subspaces.

00:29:59.299 --> 00:30:01.334
When we first discovered
adversarial examples,

00:30:01.334 --> 00:30:04.358
we thought that they might
live in little tiny pockets.

00:30:04.358 --> 00:30:06.643
In the first paper we
actually speculated that

00:30:06.643 --> 00:30:09.057
maybe they're a little bit
like the rational numbers,

00:30:09.057 --> 00:30:11.956
hiding out finely tiled
among the real numbers,

00:30:11.956 --> 00:30:15.862
with nearly every real number
being near a rational number.

00:30:15.862 --> 00:30:17.212
We thought that because
we were able to find

00:30:17.212 --> 00:30:18.940
an adversarial example corresponding

00:30:18.940 --> 00:30:22.147
to every clean example that
we loaded into the network.

00:30:22.147 --> 00:30:23.620
After doing this further analysis,

00:30:23.620 --> 00:30:27.216
we found that what's happening
is that every real example

00:30:27.216 --> 00:30:29.688
is near one of these
linear decision boundaries

00:30:29.688 --> 00:30:32.908
where you cross over into
an adversarial subspace.

00:30:32.908 --> 00:30:35.193
And once you're in that
adversarial subspace,

00:30:35.193 --> 00:30:38.738
all the other points nearby
are also adversarial examples

00:30:38.738 --> 00:30:40.790
that will be misclassified.

00:30:40.790 --> 00:30:42.412
This has security implications

00:30:42.412 --> 00:30:46.154
because it means you only need
to get the direction right.

00:30:46.154 --> 00:30:48.854
You don't need to find an
exact coordinate in space.

00:30:48.854 --> 00:30:50.640
You just need to find a direction

00:30:50.640 --> 00:30:54.382
that has a large dot product
with the sign of the gradient.

00:30:54.382 --> 00:30:56.308
And once you move more
or less approximately

00:30:56.308 --> 00:30:59.808
in that direction, you can fool the model.

00:31:01.161 --> 00:31:02.726
We also made another cross section

00:31:02.726 --> 00:31:05.659
where after using the left-right axis

00:31:05.659 --> 00:31:07.564
as the fast gradient sign method,

00:31:07.564 --> 00:31:09.187
we looked for a second direction

00:31:09.187 --> 00:31:11.884
that has high dot
product with the gradient

00:31:11.884 --> 00:31:14.966
so we could make both axes adversarial.

00:31:14.966 --> 00:31:16.363
And in this case you see that we get

00:31:16.363 --> 00:31:18.038
linear decision boundaries.

00:31:18.038 --> 00:31:21.475
They're now oriented diagonally
rather than vertically,

00:31:21.475 --> 00:31:23.207
but you can see that there's actually

00:31:23.207 --> 00:31:24.609
this two-dimensional subspace

00:31:24.609 --> 00:31:29.217
of adversarial examples
that we can cross into.

00:31:29.217 --> 00:31:30.854
Finally it's important to remember

00:31:30.854 --> 00:31:33.158
that adversarial examples are not noise.

00:31:33.158 --> 00:31:35.284
You can add a lot of noise
to an adversarial example

00:31:35.284 --> 00:31:37.334
and it will stay adversarial.

00:31:37.334 --> 00:31:39.460
You can add a lot of
noise to a clean example

00:31:39.460 --> 00:31:40.877
and it will stay clean.

00:31:40.877 --> 00:31:42.355
Here we make random cross sections

00:31:42.355 --> 00:31:45.417
where both axes are
randomly chosen directions.

00:31:45.417 --> 00:31:47.177
And you see that on CIFAR-10,

00:31:47.177 --> 00:31:49.229
most of the cells are completely white,

00:31:49.229 --> 00:31:51.916
meaning that they're correctly
classified to start with,

00:31:51.916 --> 00:31:54.993
and when you add noise they
stay correctly classified.

00:31:54.993 --> 00:31:56.953
We also see that the
model makes some mistakes

00:31:56.953 --> 00:31:58.915
because this is the test set.

00:31:58.915 --> 00:32:01.651
And generally if a test example
starts out misclassified,

00:32:01.651 --> 00:32:03.724
adding the noise doesn't change it.

00:32:03.724 --> 00:32:05.861
There are a few exceptions where,

00:32:05.861 --> 00:32:08.889
if you look in the
third row, third column,

00:32:08.889 --> 00:32:12.633
noise actually can make the
model misclassify the example

00:32:12.633 --> 00:32:14.918
for especially large noise values.

00:32:14.918 --> 00:32:17.881
And there's even some where,

00:32:17.881 --> 00:32:20.227
in the top row there's one
example you can see where

00:32:20.227 --> 00:32:23.553
the model is misclassifying
the test example to start with

00:32:23.553 --> 00:32:26.745
but then noise can change it
to be correctly classified.

00:32:26.745 --> 00:32:28.742
For the most part, noise
has very little effect

00:32:28.742 --> 00:32:31.248
on the classification decision

00:32:31.248 --> 00:32:33.461
compared to adversarial examples.

00:32:33.461 --> 00:32:36.628
What's going on here is that
in high dimensional spaces,

00:32:36.628 --> 00:32:38.860
if you choose some reference vector

00:32:38.860 --> 00:32:41.194
and then you choose a random vector

00:32:41.194 --> 00:32:42.873
in that high dimensional space,

00:32:42.873 --> 00:32:45.321
the random vector will, on average,

00:32:45.321 --> 00:32:49.982
have zero dot product
with the reference vector.

00:32:49.982 --> 00:32:51.257
So if you think about making

00:32:51.257 --> 00:32:54.497
a first order Taylor series
approximation of your cost,

00:32:54.497 --> 00:32:57.430
and thinking about how your
Taylor series approximation

00:32:57.430 --> 00:33:00.852
predicts that random vectors
will change your cost.

00:33:00.852 --> 00:33:02.580
You see that random vectors on average

00:33:02.580 --> 00:33:04.793
have no effect on the cost.

00:33:04.793 --> 00:33:08.960
But adversarial examples
are chosen to maximize it.

00:33:10.246 --> 00:33:13.505
In these plots we looked
in two dimensions.

00:33:13.505 --> 00:33:16.260
More recently, Florian
Tramer here at Stanford

00:33:16.260 --> 00:33:17.720
got interested in finding out

00:33:17.720 --> 00:33:20.702
just how many dimensions
there are to these subspaces

00:33:20.702 --> 00:33:22.702
where the adversarial examples

00:33:22.702 --> 00:33:25.908
lie in a thick contiguous region.

00:33:25.908 --> 00:33:28.716
And we came up with an algorithm together

00:33:28.716 --> 00:33:30.513
where you actually look for

00:33:30.513 --> 00:33:32.259
several different orthogonal vectors

00:33:32.259 --> 00:33:35.878
that all have a large dot
product with the gradient.

00:33:35.878 --> 00:33:38.019
By looking in several different

00:33:38.019 --> 00:33:40.256
orthogonal directions simultaneously,

00:33:40.256 --> 00:33:42.684
we can map out this kind of polytope

00:33:42.684 --> 00:33:45.833
where many different
adversarial examples live.

00:33:45.833 --> 00:33:47.974
We found out that this adversarial region

00:33:47.974 --> 00:33:51.592
has on average about 25 dimensions.

00:33:51.592 --> 00:33:53.389
If you look at different
examples you'll find

00:33:53.389 --> 00:33:56.043
different numbers of
adversarial dimensions.

00:33:56.043 --> 00:33:59.526
But on average on MNIST
we found it was about 25.

00:33:59.526 --> 00:34:02.181
So what's interesting
here is the dimensionality

00:34:02.181 --> 00:34:04.137
actually tells you something about

00:34:04.137 --> 00:34:06.782
how likely you are to find
an adversarial example

00:34:06.782 --> 00:34:09.350
by generating random noise.

00:34:09.350 --> 00:34:12.288
If every direction were adversarial,

00:34:12.288 --> 00:34:15.657
then any change would
cause a misclassification.

00:34:15.657 --> 00:34:17.692
If most of the directions
were adversarial,

00:34:17.692 --> 00:34:20.443
then random directions would
end up being adversarial

00:34:20.443 --> 00:34:22.731
just by accident most of the time.

00:34:22.731 --> 00:34:25.879
And then if there was only
one adversarial direction,

00:34:25.879 --> 00:34:28.237
you'd almost never find that direction

00:34:28.237 --> 00:34:30.219
just by adding random noise.

00:34:30.219 --> 00:34:34.088
When there's 25 you have a
chance of doing it sometimes.

00:34:34.089 --> 00:34:36.321
Another interesting thing
is that different models

00:34:36.321 --> 00:34:39.724
will often misclassify the
same adversarial examples.

00:34:39.724 --> 00:34:43.592
The subspace dimensionality
of the adversarial subspace

00:34:43.592 --> 00:34:46.275
relates to that transfer property.

00:34:46.275 --> 00:34:48.992
The larger the dimensionality
of the subspace,

00:34:48.993 --> 00:34:50.505
the more likely it is that the subspaces

00:34:50.505 --> 00:34:52.929
for two models will intersect.

00:34:52.929 --> 00:34:55.237
So if you have two different models

00:34:55.237 --> 00:34:57.220
that have a very large
adversarial subspace,

00:34:57.220 --> 00:34:58.742
you know that you can probably transfer

00:34:58.742 --> 00:35:01.161
adversarial examples
from one to the other.

00:35:01.161 --> 00:35:03.609
But if the adversarial
subspace is very small,

00:35:03.609 --> 00:35:06.796
then unless there's some kind
of really systematic effect

00:35:06.796 --> 00:35:09.603
forcing them to share
exactly the same subspace,

00:35:09.603 --> 00:35:11.548
it seems less likely that
you'll be able to transfer

00:35:11.548 --> 00:35:15.715
examples just due to the
subspaces randomly aligning.

00:35:17.716 --> 00:35:20.563
A lot of the time in
the adversarial example

00:35:20.563 --> 00:35:21.786
research community,

00:35:21.786 --> 00:35:25.080
we refer back to the story of Clever Hans.

00:35:25.080 --> 00:35:28.176
This comes from an essay
by Bob Sturm called

00:35:28.176 --> 00:35:30.408
Clever Hans, Clever Algorithms.

00:35:30.408 --> 00:35:32.764
Because Clever Hans is
a pretty good metaphor

00:35:32.764 --> 00:35:35.679
for what's happening with
machine learning algorithms.

00:35:35.679 --> 00:35:39.446
So Clever Hans was a horse
that lived in the early 1900s.

00:35:39.446 --> 00:35:43.171
His owner trained him to
do arithmetic problems.

00:35:43.171 --> 00:35:45.494
So you could ask him, "Clever Hans,

00:35:45.494 --> 00:35:47.092
"what's two plus one?"

00:35:47.092 --> 00:35:50.425
And he would answer by tapping his hoof.

00:35:52.566 --> 00:35:54.873
And after the third tap,
everybody would start

00:35:54.873 --> 00:35:56.976
cheering and clapping and looking excited

00:35:56.976 --> 00:35:59.958
because he'd actually done
an arithmetic problem.

00:35:59.958 --> 00:36:01.151
Well it turned out that

00:36:01.151 --> 00:36:03.254
he hadn't actually
learned to do arithmetic.

00:36:03.254 --> 00:36:05.256
But it was actually
pretty hard to figure out

00:36:05.256 --> 00:36:06.638
what was going on.

00:36:06.638 --> 00:36:10.924
His owner was not trying
to defraud anybody,

00:36:10.924 --> 00:36:13.588
his owner actually believed
he could do arithmetic.

00:36:13.588 --> 00:36:15.782
And presumably Clever Hans himself

00:36:15.782 --> 00:36:18.067
was not trying to trick anybody.

00:36:18.067 --> 00:36:20.390
But eventually a psychologist examined him

00:36:20.390 --> 00:36:23.832
and found that if he
was put in a room alone

00:36:23.832 --> 00:36:25.358
without an audience,

00:36:25.358 --> 00:36:29.137
and the person asking the
questions wore a mask,

00:36:29.137 --> 00:36:31.156
he couldn't figure out
when to stop tapping.

00:36:31.156 --> 00:36:32.505
You'd ask him, "Clever Hans,

00:36:32.505 --> 00:36:33.994
"what's one plus one?"

00:36:33.994 --> 00:36:37.411
And he'd just [knocking]

00:36:38.642 --> 00:36:40.084
keep staring at your face, waiting for you

00:36:40.084 --> 00:36:42.710
to give him some sign
that he was done tapping.

00:36:42.710 --> 00:36:44.784
So everybody in this situation

00:36:44.784 --> 00:36:46.975
was trying to do the right thing.

00:36:46.975 --> 00:36:48.776
Clever Hans was trying
to do whatever it took

00:36:48.776 --> 00:36:51.478
to get the apple that
his owner would give him

00:36:51.478 --> 00:36:53.275
when he answered an arithmetic problem.

00:36:53.275 --> 00:36:56.155
His owner did his best
to train him correctly

00:36:56.155 --> 00:36:57.861
with real arithmetic questions

00:36:57.861 --> 00:37:00.957
and real rewards for correct answers.

00:37:00.957 --> 00:37:03.787
And what happened was that Clever Hans

00:37:03.787 --> 00:37:07.118
inadvertently focused on the wrong cue.

00:37:07.118 --> 00:37:09.801
He found this cue of
people's social reactions

00:37:09.801 --> 00:37:12.912
that could reliably help
him solve the problem,

00:37:12.912 --> 00:37:15.231
but then it didn't
generalize to a test set

00:37:15.231 --> 00:37:18.060
where you intentionally
took that cue away.

00:37:18.060 --> 00:37:21.177
It did generalize to a
naturally occurring test set,

00:37:21.177 --> 00:37:22.958
where he had an audience.

00:37:22.958 --> 00:37:24.633
So that's more or less what's happening

00:37:24.633 --> 00:37:26.289
with machine learning algorithms.

00:37:26.289 --> 00:37:28.305
They've found these very linear patterns

00:37:28.305 --> 00:37:30.590
that can fit the training data,

00:37:30.590 --> 00:37:34.384
and these linear patterns even
generalize to the test data.

00:37:34.384 --> 00:37:36.907
They've learned to handle
any example that comes from

00:37:36.907 --> 00:37:40.415
the same distribution
as their training data.

00:37:40.415 --> 00:37:42.163
But then if you shift the distribution

00:37:42.163 --> 00:37:43.603
that you test them on,

00:37:43.603 --> 00:37:46.934
if a malicious adversary
actually creates examples

00:37:46.934 --> 00:37:48.570
that are intended to fool them,

00:37:48.570 --> 00:37:50.820
they're very easily fooled.

00:37:51.686 --> 00:37:54.316
In fact we find that modern
machine learning algorithms

00:37:54.316 --> 00:37:56.726
are wrong almost everywhere.

00:37:56.726 --> 00:37:59.606
We tend to think of them as
being correct most of the time,

00:37:59.606 --> 00:38:02.073
because when we run them on
naturally occurring inputs

00:38:02.073 --> 00:38:06.048
they achieve very high
accuracy percentages.

00:38:06.048 --> 00:38:08.440
But if we look instead
of as the percentage

00:38:08.440 --> 00:38:11.107
of samples from an IID test set,

00:38:12.007 --> 00:38:15.628
if we look at the percentage
of the space in RN

00:38:15.628 --> 00:38:17.655
that is correctly classified,

00:38:17.655 --> 00:38:20.649
we find that they
misclassify almost everything

00:38:20.649 --> 00:38:24.158
and they behave reasonably
only on a very thin manifold

00:38:24.158 --> 00:38:27.489
surrounding the data
that we train them on.

00:38:27.489 --> 00:38:30.187
In this plot, I show you
several different examples

00:38:30.187 --> 00:38:32.006
of Gaussian noise

00:38:32.006 --> 00:38:35.075
that I've run through
a CIFAR-10 classifier.

00:38:35.075 --> 00:38:37.100
Everywhere that there is a pink box,

00:38:37.100 --> 00:38:39.213
the classifier thinks
that there is something

00:38:39.213 --> 00:38:40.780
rather than nothing.

00:38:40.780 --> 00:38:43.030
I'll come back to what
that means in a second.

00:38:43.030 --> 00:38:45.227
Everywhere that there is a yellow box,

00:38:45.227 --> 00:38:47.622
one step of the fast gradient sign method

00:38:47.622 --> 00:38:50.132
was able to persuade the
model that it was looking

00:38:50.132 --> 00:38:52.395
specifically at an airplane.

00:38:52.395 --> 00:38:53.731
I chose the airplane class

00:38:53.731 --> 00:38:56.254
because it was the one with
the lowest success rate.

00:38:56.254 --> 00:38:58.671
It had about a 25% success rate.

00:38:58.671 --> 00:39:01.898
That means an attacker
would need four chances

00:39:01.898 --> 00:39:06.291
to get noise recognized as
an airplane on this model.

00:39:06.291 --> 00:39:08.494
An interesting thing,
and appropriate enough

00:39:08.494 --> 00:39:09.994
given the story of Clever Hans,

00:39:09.994 --> 00:39:12.903
is that this model found
that about 70% of RN

00:39:12.903 --> 00:39:15.070
was classified as a horse.

00:39:17.510 --> 00:39:20.194
So I mentioned that this model will say

00:39:20.194 --> 00:39:22.606
that noise is something
rather than nothing.

00:39:22.606 --> 00:39:24.450
And it's actually kind of
important to think about

00:39:24.450 --> 00:39:26.401
how we evaluate that.

00:39:26.401 --> 00:39:28.498
If you have a softmax classifier,

00:39:28.498 --> 00:39:30.529
it has to give you a distribution

00:39:30.529 --> 00:39:34.158
over the n different classes
that you train it on.

00:39:34.158 --> 00:39:35.825
So there's a few ways that you can argue

00:39:35.825 --> 00:39:37.119
that the model is telling you

00:39:37.119 --> 00:39:39.138
that there's something
rather than nothing.

00:39:39.138 --> 00:39:42.026
One is you can say, if it
assigns something like 90%

00:39:42.026 --> 00:39:43.698
to one particular class,

00:39:43.698 --> 00:39:46.373
that seems to be voting
for that class being there.

00:39:46.373 --> 00:39:47.705
We'd much rather see it give us

00:39:47.705 --> 00:39:50.018
something like a uniform
distribution saying

00:39:50.018 --> 00:39:52.833
this noise doesn't look like
anything in the training set

00:39:52.833 --> 00:39:56.177
so it's equally likely
to be a horse or a car.

00:39:56.177 --> 00:39:58.075
And that's not what the model does.

00:39:58.075 --> 00:40:01.028
It'll say, this is very
definitely a horse.

00:40:01.028 --> 00:40:03.395
Another thing that you
can do is you can replace

00:40:03.395 --> 00:40:05.186
the last layer of the model.

00:40:05.186 --> 00:40:10.009
For example, you can use a
sigmoid output for each class.

00:40:10.009 --> 00:40:11.754
And then the model is actually
capable of telling you

00:40:11.754 --> 00:40:14.407
that any subset of classes is present.

00:40:14.407 --> 00:40:15.777
It could actually tell you that an image

00:40:15.777 --> 00:40:17.250
is both a horse and a car.

00:40:17.250 --> 00:40:19.292
And what we would like
it to do for the noise

00:40:19.292 --> 00:40:21.962
is tell us that none of
the classes is present,

00:40:21.962 --> 00:40:23.585
that all of the sigmoids
should have a value

00:40:23.585 --> 00:40:25.346
of less than 1/2.

00:40:25.346 --> 00:40:29.479
And 1/2 isn't even
particularly a low threshold.

00:40:29.479 --> 00:40:32.034
We could reasonably expect that
all of the sigmoids would be

00:40:32.034 --> 00:40:35.982
less than 0.01 for such a
defective input as this.

00:40:35.982 --> 00:40:38.226
But what we find instead
is that the sigmoids

00:40:38.226 --> 00:40:40.177
tend to have at least one class present

00:40:40.177 --> 00:40:42.122
just when we run Gaussian noise

00:40:42.122 --> 00:40:45.205
of sufficient norm through the model.

00:40:48.050 --> 00:40:50.269
We've also found that we
can do adversarial examples

00:40:50.269 --> 00:40:51.946
for reinforcement learning.

00:40:51.946 --> 00:40:53.329
And there's a video for this.

00:40:53.329 --> 00:40:54.946
I'll upload the slides after the talk

00:40:54.946 --> 00:40:56.202
and you can follow the link.

00:40:56.202 --> 00:40:58.082
Unfortunately I wasn't able
to get the WiFi to work

00:40:58.082 --> 00:41:00.245
so I can't show you the video animated.

00:41:00.245 --> 00:41:01.482
But I can describe
basically what's going on

00:41:01.482 --> 00:41:03.232
from this still here.

00:41:05.258 --> 00:41:08.149
There's a game Seaquest on Atari

00:41:08.149 --> 00:41:09.897
where you can train
reinforcement learning agents

00:41:09.897 --> 00:41:11.110
to play that game.

00:41:11.110 --> 00:41:14.270
And you can take the raw input pixels

00:41:14.270 --> 00:41:18.242
and you can take the
fast gradient sign method

00:41:18.242 --> 00:41:21.642
or other attacks that use other
norms besides the max norm,

00:41:21.642 --> 00:41:24.586
and compute perturbations
that are intended

00:41:24.586 --> 00:41:27.646
to change the action that
the policy would select.

00:41:27.646 --> 00:41:29.566
So the reinforcement learning policy,

00:41:29.566 --> 00:41:31.350
you can think of it as just
being like a classifier

00:41:31.350 --> 00:41:33.211
that looks at a frame.

00:41:33.211 --> 00:41:35.550
And instead of categorizing the input

00:41:35.550 --> 00:41:37.126
into a particular category,

00:41:37.126 --> 00:41:40.753
it gives you a softmax
distribution over actions to take.

00:41:40.753 --> 00:41:43.427
So if we just take that and
say that the most likely action

00:41:43.427 --> 00:41:47.482
should have its accuracy be
decreased by the adversary.

00:41:47.482 --> 00:41:49.261
Sorry, to have its probability

00:41:49.261 --> 00:41:51.034
be decreased by the adversary,

00:41:51.034 --> 00:41:53.030
you'll get these
perturbations of input frames

00:41:53.030 --> 00:41:55.762
that you can then apply
and cause the agent

00:41:55.762 --> 00:41:58.670
to play different actions
than it would have otherwise.

00:41:58.670 --> 00:42:00.268
And using this you can make the agent

00:42:00.268 --> 00:42:02.851
play Seaquest very, very badly.

00:42:03.786 --> 00:42:06.179
It's maybe not the most
interesting possible thing.

00:42:06.179 --> 00:42:07.767
What we'd really like is an environment

00:42:07.767 --> 00:42:09.993
where there are many different
reward functions available

00:42:09.993 --> 00:42:11.238
for us to study.

00:42:11.238 --> 00:42:14.071
So for example, if you had a robot

00:42:15.092 --> 00:42:17.579
that was intended to cook scrambled eggs,

00:42:17.579 --> 00:42:18.865
and you had a reward function measuring

00:42:18.865 --> 00:42:20.610
how well it's cooking scrambled eggs,

00:42:20.610 --> 00:42:22.397
and you had another reward function

00:42:22.397 --> 00:42:25.649
measuring how well it's
cooking chocolate cake,

00:42:25.649 --> 00:42:27.849
it would be really
interesting if we could make

00:42:27.849 --> 00:42:29.925
adversarial examples that cause the robot

00:42:29.925 --> 00:42:31.501
to make a chocolate cake

00:42:31.501 --> 00:42:35.017
when the user intended for
it to make scrambled eggs.

00:42:35.017 --> 00:42:37.581
That's because it's very
difficult to succeed at something

00:42:37.581 --> 00:42:40.393
and it's relatively straightforward
to make a system fail.

00:42:40.393 --> 00:42:42.400
So right now, adversarial examples for RL

00:42:42.400 --> 00:42:45.049
are very good at showing that
we can make RL agents fail.

00:42:45.049 --> 00:42:47.827
But we haven't yet been
able to hijack them

00:42:47.827 --> 00:42:49.229
and make them do a complicated task

00:42:49.229 --> 00:42:51.429
that's different from
what their owner intended.

00:42:51.429 --> 00:42:53.405
Seems like it's one of the next steps

00:42:53.405 --> 00:42:56.655
in adversarial example research though.

00:42:58.101 --> 00:43:01.078
If we look at high-dimension
linear models,

00:43:01.078 --> 00:43:02.479
we can actually see that a lot of this

00:43:02.479 --> 00:43:04.682
is very simple and straightforward.

00:43:04.682 --> 00:43:07.585
Here we have a logistic regression model

00:43:07.585 --> 00:43:10.385
that classifies sevens and threes.

00:43:10.385 --> 00:43:13.665
So the whole model can be
described just by a weight vector

00:43:13.665 --> 00:43:16.807
and a single scalar bias term.

00:43:16.807 --> 00:43:20.404
We don't really need to see the
bias term for this exercise.

00:43:20.404 --> 00:43:22.063
If you look on the left
I've plotted the weights

00:43:22.063 --> 00:43:24.929
that we used to discriminate
sevens and threes.

00:43:24.929 --> 00:43:27.505
The weights should look a
little bit like the difference

00:43:27.505 --> 00:43:30.098
between the average seven
and the average three.

00:43:30.098 --> 00:43:31.505
And then down at the bottom we've taken

00:43:31.505 --> 00:43:33.225
the sign of the weights.

00:43:33.225 --> 00:43:35.764
So the gradient for a
logistic regression model

00:43:35.764 --> 00:43:38.529
is going to be proportional
to the weights.

00:43:38.529 --> 00:43:41.505
And then the sign of the weights gives you

00:43:41.505 --> 00:43:43.981
essentially the sign of the gradient.

00:43:43.981 --> 00:43:46.268
So we can do the fast gradient sign method

00:43:46.268 --> 00:43:49.955
to attack this model just
by looking at its weights.

00:43:49.955 --> 00:43:52.619
In the examples in the panel

00:43:52.619 --> 00:43:54.327
that's the second column from the left

00:43:54.327 --> 00:43:55.981
we can see clean examples.

00:43:55.981 --> 00:43:58.302
And then on the right we've
just added or subtracted

00:43:58.302 --> 00:44:00.900
this image of the sign of
the weights off of them.

00:44:00.900 --> 00:44:03.515
To you and me as human observers,

00:44:03.515 --> 00:44:06.871
the sign of the weights
is just like garbage

00:44:06.871 --> 00:44:08.204
that's in the background,

00:44:08.204 --> 00:44:09.743
and we more or less filter it out.

00:44:09.743 --> 00:44:11.868
It doesn't look particularly
interesting to us.

00:44:11.868 --> 00:44:14.364
It doesn't grab our attention.

00:44:14.364 --> 00:44:16.001
To the logistic regression model

00:44:16.001 --> 00:44:17.607
this image of the sign of the weights

00:44:17.607 --> 00:44:20.449
is the most salient thing

00:44:20.449 --> 00:44:22.791
that could ever appear in the image.

00:44:22.791 --> 00:44:24.567
When it's positive it looks like

00:44:24.567 --> 00:44:26.748
the world's most quintessential seven.

00:44:26.748 --> 00:44:27.959
When it's negative it looks like

00:44:27.959 --> 00:44:29.684
the world's most quintessential three.

00:44:29.684 --> 00:44:31.127
And so the model makes its decision

00:44:31.127 --> 00:44:33.242
almost entirely based on this perturbation

00:44:33.242 --> 00:44:37.409
we added to the image, rather
than on the background.

00:44:38.498 --> 00:44:40.007
You could also take this same procedure,

00:44:40.007 --> 00:44:44.174
and my colleague Andrej at
OpenAI showed how you can

00:44:45.271 --> 00:44:49.063
modify the image on ImageNet
using this same approach,

00:44:49.063 --> 00:44:51.706
and turn this goldfish into a daisy.

00:44:51.706 --> 00:44:53.831
Because ImageNet is
much higher dimensional,

00:44:53.831 --> 00:44:56.769
you don't need to use quite
as large of a coefficient

00:44:56.769 --> 00:44:58.761
on the image of the weights.

00:44:58.761 --> 00:45:03.226
So we can make a more
persuasive fooling attack.

00:45:03.226 --> 00:45:05.249
You can see that this
same image of the weights,

00:45:05.249 --> 00:45:08.631
when applied to any different input image,

00:45:08.631 --> 00:45:12.231
will actually reliably
cause a misclassification.

00:45:12.231 --> 00:45:14.951
What's going on is that there
are many different classes,

00:45:14.951 --> 00:45:18.822
and it means that if
you choose the weights

00:45:18.822 --> 00:45:20.504
for any particular class,

00:45:20.504 --> 00:45:23.364
it's very unlikely that a new test image

00:45:23.364 --> 00:45:25.642
will belong to that class.

00:45:25.642 --> 00:45:27.349
So on ImageNet, if we're using

00:45:27.349 --> 00:45:29.351
the weights for the daisy class,

00:45:29.351 --> 00:45:31.431
and there are 1,000 different classes,

00:45:31.431 --> 00:45:33.628
then we have about a 99.9% chance

00:45:33.628 --> 00:45:36.122
that a test image will not be a daisy.

00:45:36.122 --> 00:45:37.767
If we then go ahead and add the weights

00:45:37.767 --> 00:45:39.809
for the daisy class to that image,

00:45:39.809 --> 00:45:41.889
then we get a daisy,
and because that's not

00:45:41.889 --> 00:45:45.207
the correct class, it's
a misclassification.

00:45:45.207 --> 00:45:47.068
So there's a paper at CVPR this year

00:45:47.068 --> 00:45:48.748
called Universal Adversarial Perturbations

00:45:48.748 --> 00:45:51.287
that expands a lot more
on this observation

00:45:51.287 --> 00:45:53.799
that we had going back in 2014.

00:45:53.799 --> 00:45:56.647
But basically these weight vectors,

00:45:56.647 --> 00:45:59.031
when applied to many different images,

00:45:59.031 --> 00:46:02.614
can cause misclassification
in all of them.

00:46:04.647 --> 00:46:06.303
I've spent a lot of time telling you

00:46:06.303 --> 00:46:08.508
that these linear models
are just terrible,

00:46:08.508 --> 00:46:11.269
and at some point you've
probably been hoping

00:46:11.269 --> 00:46:13.089
I would give you some sort
of a control experiment

00:46:13.089 --> 00:46:15.468
to convince you that there's another model

00:46:15.468 --> 00:46:16.988
that's not terrible.

00:46:16.988 --> 00:46:19.351
So it turns out that some quadratic models

00:46:19.351 --> 00:46:21.249
actually perform really well.

00:46:21.249 --> 00:46:23.927
In particular a shallow RBF network

00:46:23.927 --> 00:46:27.687
is able to resist adversarial
perturbations very well.

00:46:27.687 --> 00:46:29.047
Earlier I showed you an animation

00:46:29.047 --> 00:46:30.522
where I took a nine and I turned it into

00:46:30.522 --> 00:46:32.108
a zero, one, two, and so on,

00:46:32.108 --> 00:46:34.884
without really changing
its appearance at all.

00:46:34.884 --> 00:46:36.028
And I was able to fool

00:46:36.028 --> 00:46:39.329
a linear softmax regression classifier.

00:46:39.329 --> 00:46:40.947
Here I've got an RBF network

00:46:40.947 --> 00:46:43.384
where it outputs a separate probability

00:46:43.384 --> 00:46:45.388
of each class being absent or present,

00:46:45.388 --> 00:46:49.555
and that probability is given
by e to the negative square

00:46:51.111 --> 00:46:53.271
of the difference between a template image

00:46:53.271 --> 00:46:55.489
and the input image.

00:46:55.489 --> 00:46:59.108
And if we actually follow the
gradient of this classifier,

00:46:59.108 --> 00:47:01.903
it does actually turn the image into

00:47:01.903 --> 00:47:04.801
a zero, a one, a two, a three, and so on,

00:47:04.801 --> 00:47:07.249
and we can actually
recognize those changes.

00:47:07.249 --> 00:47:09.649
The problem is, this
classifier does not get

00:47:09.649 --> 00:47:12.164
very good accuracy on the training set.

00:47:12.164 --> 00:47:13.767
It's a shallow model.

00:47:13.767 --> 00:47:15.503
It's basically just a template matcher.

00:47:15.503 --> 00:47:17.511
It is literally a template matcher.

00:47:17.511 --> 00:47:20.689
And if you try to make
it more sophisticated

00:47:20.689 --> 00:47:22.049
by making it deeper,

00:47:22.049 --> 00:47:26.216
it turns out that the gradient
of these RBF units is zero,

00:47:27.648 --> 00:47:30.762
or very near zero, throughout most of RN.

00:47:30.762 --> 00:47:32.769
So they're extremely difficult to train,

00:47:32.769 --> 00:47:36.289
even with batch normalization
and methods like that.

00:47:36.289 --> 00:47:39.727
I haven't managed to train
a deep RBF network yet.

00:47:39.727 --> 00:47:42.748
But I think if somebody comes
up with better hyperparameters

00:47:42.748 --> 00:47:46.102
or a new, more powerful
optimization algorithm,

00:47:46.102 --> 00:47:47.489
it might be possible to solve

00:47:47.489 --> 00:47:49.344
the adversarial example problem

00:47:49.344 --> 00:47:51.489
by training a deep RBF network

00:47:51.489 --> 00:47:55.985
where the model is so nonlinear
and has such wide flat areas

00:47:55.985 --> 00:47:59.409
that the adversary is not
able to push the cost uphill

00:47:59.409 --> 00:48:03.576
just by making small changes
to the model's input.

00:48:05.242 --> 00:48:06.887
One of the things that's the most alarming

00:48:06.887 --> 00:48:08.209
about adversarial examples

00:48:08.209 --> 00:48:11.649
is that they generalize
from one dataset to another

00:48:11.649 --> 00:48:13.468
and one model to another.

00:48:13.468 --> 00:48:15.329
Here I've trained two different models

00:48:15.329 --> 00:48:17.478
on two different training sets.

00:48:17.478 --> 00:48:20.145
The training sets are tiny in both cases.

00:48:20.145 --> 00:48:23.425
It's just MNIST three
versus seven classification,

00:48:23.425 --> 00:48:26.696
and this is really just for
the purpose of making a slide.

00:48:26.696 --> 00:48:29.207
If you train a logistic regression model

00:48:29.207 --> 00:48:32.644
on the digits shown in the left panel,

00:48:32.644 --> 00:48:35.903
you get the weights shown on
the left in the lower panel.

00:48:35.903 --> 00:48:37.585
If you train a logistic regression model

00:48:37.585 --> 00:48:39.729
on the digits shown in the upper right,

00:48:39.729 --> 00:48:42.564
you get the weights shown on
the right in the lower panel.

00:48:42.564 --> 00:48:44.225
So you've got two different training sets

00:48:44.225 --> 00:48:45.619
and we learn weight vectors that look

00:48:45.619 --> 00:48:47.143
very similar to each other.

00:48:47.143 --> 00:48:50.080
That's just because machine
learning algorithms generalize.

00:48:50.080 --> 00:48:51.884
You want them to learn a function that's

00:48:51.884 --> 00:48:54.740
somewhat independent of the
data that you train them on.

00:48:54.740 --> 00:48:55.879
It shouldn't matter which particular

00:48:55.879 --> 00:48:57.884
training examples you choose.

00:48:57.884 --> 00:48:58.924
If you want to generalize

00:48:58.924 --> 00:49:00.545
from the training set to the test set,

00:49:00.545 --> 00:49:02.781
you've also got to expect
that different training sets

00:49:02.781 --> 00:49:05.002
will give you more or
less the same result.

00:49:05.002 --> 00:49:06.583
And that means that
because they've learned

00:49:06.583 --> 00:49:08.340
more or less similar functions,

00:49:08.340 --> 00:49:13.237
they're vulnerable to
similar adversarial examples.

00:49:13.237 --> 00:49:15.723
An adversary can compute
an image that fools one

00:49:15.723 --> 00:49:18.461
and use it to fool the other.

00:49:18.461 --> 00:49:20.738
In fact we can actually
go ahead and measure

00:49:20.738 --> 00:49:22.386
the transfer rate between

00:49:22.386 --> 00:49:24.684
several different machine
learning techniques,

00:49:24.684 --> 00:49:27.154
not just different data sets.

00:49:27.154 --> 00:49:28.881
Nicolas Papernot and his collaborators

00:49:28.881 --> 00:49:30.799
have spent a lot of time exploring

00:49:30.799 --> 00:49:32.718
this transferability effect.

00:49:32.718 --> 00:49:35.965
And they found that for example,

00:49:35.965 --> 00:49:38.200
logistic regression makes
adversarial examples

00:49:38.200 --> 00:49:42.367
that transfer to decision
trees with 87.4% probability.

00:49:43.999 --> 00:49:48.058
Wherever you see dark
squares in this matrix,

00:49:48.058 --> 00:49:50.823
that shows that there's a
high amount of transfer.

00:49:50.823 --> 00:49:53.225
That means that it's very
possible for an attacker

00:49:53.225 --> 00:49:55.475
using the model on the left

00:49:56.380 --> 00:50:00.547
to create adversarial examples
for the model on the right.

00:50:01.578 --> 00:50:03.324
The procedure overall is that,

00:50:03.324 --> 00:50:05.100
suppose the attacker wants to fool a model

00:50:05.100 --> 00:50:07.863
that they don't actually have access to.

00:50:07.863 --> 00:50:10.364
They don't know the
architecture that's used

00:50:10.364 --> 00:50:11.783
to train the model.

00:50:11.783 --> 00:50:13.770
They may not even know which
algorithm is being used.

00:50:13.770 --> 00:50:15.198
They may not know
whether they're attacking

00:50:15.198 --> 00:50:17.260
a decision tree or a deep neural net.

00:50:17.260 --> 00:50:20.540
And they also don't know the parameters

00:50:20.540 --> 00:50:23.303
of the model that they're going to attack.

00:50:23.303 --> 00:50:26.089
So what they can do is
train their own model

00:50:26.089 --> 00:50:29.172
that they'll use to build the attack.

00:50:30.272 --> 00:50:32.175
There's two different ways
you can train your own model.

00:50:32.175 --> 00:50:33.703
One is you can label your own training set

00:50:33.703 --> 00:50:36.620
for the same task that you want to attack.

00:50:36.620 --> 00:50:39.802
Say that somebody is using
an ImageNet classifier,

00:50:39.802 --> 00:50:42.924
and for whatever reason you
don't have access to ImageNet,

00:50:42.924 --> 00:50:44.797
you can take your own
photos and label them,

00:50:44.797 --> 00:50:46.939
train your own object recognizer.

00:50:46.939 --> 00:50:48.620
It's going to share adversarial examples

00:50:48.620 --> 00:50:50.700
with an ImageNet model.

00:50:50.700 --> 00:50:52.384
The other thing you can do is,

00:50:52.384 --> 00:50:55.361
say that you can't afford to
gather your own training set.

00:50:55.361 --> 00:50:57.420
What you can do instead is if you can get

00:50:57.420 --> 00:50:59.041
limited access to the model

00:50:59.041 --> 00:51:02.236
where you just have the ability
to send inputs to the model

00:51:02.236 --> 00:51:03.804
and observe its outputs,

00:51:03.804 --> 00:51:06.700
then you can send those
inputs, observe the outputs,

00:51:06.700 --> 00:51:09.361
and use those as your training set.

00:51:09.361 --> 00:51:11.201
This'll work even if the output

00:51:11.201 --> 00:51:12.740
that you get from the target model

00:51:12.740 --> 00:51:15.943
is only the class label that it chooses.

00:51:15.943 --> 00:51:17.882
A lot of people read this and assume that

00:51:17.882 --> 00:51:19.004
you need to have access

00:51:19.004 --> 00:51:21.244
to all the probability values it outputs.

00:51:21.244 --> 00:51:24.975
But even just the class
labels are sufficient.

00:51:24.975 --> 00:51:26.684
So once you've used one
of these two methods,

00:51:26.684 --> 00:51:28.204
either gather your own training set

00:51:28.204 --> 00:51:31.324
or observing the outputs
of a target model,

00:51:31.324 --> 00:51:32.877
you can train your own model

00:51:32.877 --> 00:51:36.444
and then make adversarial
examples for your model.

00:51:36.444 --> 00:51:38.823
Those adversarial examples
are very likely to transfer

00:51:38.823 --> 00:51:41.178
and affect the target model.

00:51:41.178 --> 00:51:43.736
So you can then go and
send those out and fool it,

00:51:43.736 --> 00:51:47.569
even if you didn't have
access to it directly.

00:51:48.513 --> 00:51:50.503
We've also measured the transferability

00:51:50.503 --> 00:51:52.360
across different data sets,

00:51:52.360 --> 00:51:54.583
and for most models we find that they're

00:51:54.583 --> 00:51:56.204
kind of in an intermediate zone

00:51:56.204 --> 00:51:58.103
where different data sets will result

00:51:58.103 --> 00:52:01.476
in a transfer rate of, like, 60% to 80%.

00:52:01.476 --> 00:52:04.001
There's a few models like SVMs
that are very data dependent

00:52:04.001 --> 00:52:08.103
because SVMs end up focusing
on a very small subset

00:52:08.103 --> 00:52:10.941
of the training data to form
their final decision boundary.

00:52:10.941 --> 00:52:12.744
But most models that we care about

00:52:12.744 --> 00:52:15.994
are somewhere in the intermediate zone.

00:52:17.444 --> 00:52:19.554
Now that's just assuming that you rely

00:52:19.554 --> 00:52:22.596
on the transfer happening naturally.

00:52:22.596 --> 00:52:23.879
You make an adversarial example

00:52:23.879 --> 00:52:26.740
and you hope that it will
transfer to your target.

00:52:26.740 --> 00:52:30.353
What if you do something to
stack the deck in your favor

00:52:30.353 --> 00:52:33.211
and improve the odds that you'll get

00:52:33.211 --> 00:52:35.860
your adversarial examples to transfer?

00:52:35.860 --> 00:52:38.937
Dawn Song's group at UC
Berkeley studied this.

00:52:38.937 --> 00:52:43.060
They found that if they take
an ensemble of different models

00:52:43.060 --> 00:52:46.078
and they use gradient
descent to search for

00:52:46.078 --> 00:52:47.998
an adversarial example that will fool

00:52:47.998 --> 00:52:50.297
every member of their ensemble,

00:52:50.297 --> 00:52:53.337
then it's extremely likely
that it will transfer

00:52:53.337 --> 00:52:56.958
and fool a new machine learning model.

00:52:56.958 --> 00:52:59.131
So if you have an ensemble of five models,

00:52:59.131 --> 00:53:00.315
you can get it to the point where

00:53:00.315 --> 00:53:02.596
there's essentially a 100% chance

00:53:02.596 --> 00:53:04.654
that you'll fool a sixth model

00:53:04.654 --> 00:53:07.249
out of the set of models
that they compared.

00:53:07.249 --> 00:53:09.881
They looked at things like
ResNets of different depths,

00:53:09.881 --> 00:53:11.464
VGG, and GoogLeNet.

00:53:12.752 --> 00:53:16.055
So in the labels for each
of the different rows

00:53:16.055 --> 00:53:18.201
you can see that they
made ensembles that lacked

00:53:18.201 --> 00:53:19.835
each of these different models,

00:53:19.835 --> 00:53:23.321
and then they would test it on
the different target models.

00:53:23.321 --> 00:53:28.137
So like if you make an
ensemble that omits GoogLeNet,

00:53:28.137 --> 00:53:32.076
you have only about a
5% chance of GoogLeNet

00:53:32.076 --> 00:53:34.521
correctly classifying
the adversarial example

00:53:34.521 --> 00:53:37.023
you make for that ensemble.

00:53:37.023 --> 00:53:40.507
If you make an ensemble
that omits ResNet-152,

00:53:40.507 --> 00:53:42.353
in their experiments they found that

00:53:42.353 --> 00:53:46.520
there was a 0% chance of
ResNet-152 resisting that attack.

00:53:48.531 --> 00:53:50.337
That probably indicates
they should have run

00:53:50.337 --> 00:53:52.004
some more adversarial examples

00:53:52.004 --> 00:53:54.697
until they found a non-zero success rate,

00:53:54.697 --> 00:53:57.969
but it does show that the
attack is very powerful.

00:53:57.969 --> 00:53:59.770
And then when you go look into

00:53:59.770 --> 00:54:01.713
intentionally cause the transfer effect,

00:54:01.713 --> 00:54:04.713
you can really make it quite strong.

00:54:05.872 --> 00:54:08.241
A lot of people often
ask me if the human brain

00:54:08.241 --> 00:54:10.808
is vulnerable to adversarial examples.

00:54:10.808 --> 00:54:14.436
And for this lecture I can't
use copyrighted material,

00:54:14.436 --> 00:54:17.360
but there's some really
hilarious things on the Internet

00:54:17.360 --> 00:54:19.693
if you go looking for, like,

00:54:21.329 --> 00:54:23.833
the fake CAPTCHA with
images of Mark Hamill,

00:54:23.833 --> 00:54:27.214
you'll find something
that my perception system

00:54:27.214 --> 00:54:29.015
definitely can't handle.

00:54:29.015 --> 00:54:31.708
So here's another one
that's actually published

00:54:31.708 --> 00:54:35.577
with a license where I was
confident I'm allowed to use it.

00:54:35.577 --> 00:54:38.473
You can look at this image
of different circles here,

00:54:38.473 --> 00:54:42.217
and they appear to be intertwined spirals.

00:54:42.217 --> 00:54:45.210
But in fact they are concentric circles.

00:54:45.210 --> 00:54:47.521
The orientation of the
edges of the squares

00:54:47.521 --> 00:54:51.177
is interfering with the edge
detectors in your brain,

00:54:51.177 --> 00:54:55.468
making it look like the
circles are spiraling.

00:54:55.468 --> 00:54:57.372
So you can think of
these optical illusions

00:54:57.372 --> 00:54:59.847
as being adversarial
examples in the human brain.

00:54:59.847 --> 00:55:01.908
What's interesting is that
we don't seem to share

00:55:01.908 --> 00:55:03.589
many adversarial examples in common

00:55:03.589 --> 00:55:05.732
with machine learning models.

00:55:05.732 --> 00:55:08.174
Adversarial examples
transfer extremely reliably

00:55:08.174 --> 00:55:09.970
between different machine learning models,

00:55:09.970 --> 00:55:11.956
especially if you use that ensemble trick

00:55:11.956 --> 00:55:15.492
that was developed at UC Berkeley.

00:55:15.492 --> 00:55:18.654
But those adversarial
examples don't fool us.

00:55:18.654 --> 00:55:20.212
It tells us that we must be using

00:55:20.212 --> 00:55:22.436
a very different algorithm or model family

00:55:22.436 --> 00:55:25.417
than current convolutional networks.

00:55:25.417 --> 00:55:27.273
We don't really know what
the difference is yet,

00:55:27.273 --> 00:55:30.023
but it would be very
interesting to figure that out.

00:55:30.023 --> 00:55:32.953
It seems to suggest that
studying adversarial examples

00:55:32.953 --> 00:55:35.353
could tell us how to significantly improve

00:55:35.353 --> 00:55:37.854
our existing machine learning models.

00:55:37.854 --> 00:55:40.413
Even if you don't care
about having an adversary,

00:55:40.413 --> 00:55:43.113
we might figure out
something or other about

00:55:43.113 --> 00:55:45.111
how to make machine learning algorithms

00:55:45.111 --> 00:55:48.116
deal with ambiguity and unexpected inputs

00:55:48.116 --> 00:55:50.033
more like a human does.

00:55:52.106 --> 00:55:55.594
If we actually want to go out
and do attacks in practice,

00:55:55.594 --> 00:56:00.276
there's started to be a body
of research on this subject.

00:56:00.276 --> 00:56:03.060
Nicolas Papernot showed that he could use

00:56:03.060 --> 00:56:05.897
the transfer effect to fool classifiers

00:56:05.897 --> 00:56:09.177
hosted by MetaMind, Amazon, and Google.

00:56:09.177 --> 00:56:11.452
So these are all just
different machine learning APIs

00:56:11.452 --> 00:56:13.755
where you can upload a dataset

00:56:13.755 --> 00:56:16.275
and the API will train the model for you.

00:56:16.275 --> 00:56:19.038
And then you don't actually
know, in most cases,

00:56:19.038 --> 00:56:21.316
which model is trained for you.

00:56:21.316 --> 00:56:23.714
You don't have access to its
weights or anything like that.

00:56:23.714 --> 00:56:26.168
So Nicolas would train
his own copy of the model

00:56:26.168 --> 00:56:27.553
using the API,

00:56:27.553 --> 00:56:31.256
and then build a model on
his own personal desktop

00:56:31.256 --> 00:56:34.169
where he could fool the API hosted model.

00:56:34.169 --> 00:56:36.917
Later, Berkeley showed you
could fool Clarifai in this way.

00:56:36.917 --> 00:56:37.750
Yeah?

00:56:37.750 --> 00:56:39.273
- [Man] What did you mean when you said

00:56:39.273 --> 00:56:41.222
machine having adversarial
models don't generally fool us?

00:56:41.222 --> 00:56:43.054
Because I thought that
was part of the point

00:56:43.054 --> 00:56:46.724
that we generally do
machine-generated adversarial models

00:56:46.724 --> 00:56:48.990
where just a few pixels change.

00:56:48.990 --> 00:56:51.990
- Oh, so if we look at, for example,

00:56:53.623 --> 00:56:55.070
like this picture of the panda.

00:56:55.070 --> 00:56:56.497
To us it looks like a panda.

00:56:56.497 --> 00:56:59.837
To most machine learning
models it looks like a gibbon.

00:56:59.837 --> 00:57:02.830
And so this change isn't
interfering with our brains,

00:57:02.830 --> 00:57:04.963
but it fools reliably
with lots of different

00:57:04.963 --> 00:57:06.963
machine learning models.

00:57:08.713 --> 00:57:12.836
I saw somebody actually took
this image of the perturbation

00:57:12.836 --> 00:57:15.433
out of our paper, and they pasted it

00:57:15.433 --> 00:57:17.396
on their Facebook profile picture

00:57:17.396 --> 00:57:20.551
to see if it could interfere
with Facebook recognizing them.

00:57:20.551 --> 00:57:22.713
And they said that it did.

00:57:22.713 --> 00:57:25.956
I don't think that Facebook
has a gibbon tag though,

00:57:25.956 --> 00:57:29.644
so we don't know if they managed to

00:57:29.644 --> 00:57:32.811
make it think that they were a gibbon.

00:57:34.138 --> 00:57:35.977
And one of the other
things that you can do

00:57:35.977 --> 00:57:39.161
that's of fairly high
practical significance

00:57:39.161 --> 00:57:42.238
is you can actually
fool malware detectors.

00:57:42.238 --> 00:57:44.201
Catherine Gross at the
University of Saarland

00:57:44.201 --> 00:57:45.657
wrote a paper about this.

00:57:45.657 --> 00:57:47.276
And there's starting to be a few others.

00:57:47.276 --> 00:57:50.201
There's a model called MalGAN
that actually uses a GAN

00:57:50.201 --> 00:57:54.815
to generate adversarial
examples for malware detectors.

00:57:54.815 --> 00:57:57.300
Another thing that matters
a lot if you are interested

00:57:57.300 --> 00:57:58.840
in using these attacks in the real world

00:57:58.840 --> 00:58:00.724
and defending against
them in the real world

00:58:00.724 --> 00:58:02.956
is that a lot of the
time you don't actually

00:58:02.956 --> 00:58:06.057
have access to the
digital input to a model.

00:58:06.057 --> 00:58:09.017
If you're interested in
the perception system

00:58:09.017 --> 00:58:11.300
for a self-driving car or a robot,

00:58:11.300 --> 00:58:14.116
you probably don't get to
actually write to the buffer

00:58:14.116 --> 00:58:15.737
on the robot itself.

00:58:15.737 --> 00:58:18.420
You just get to show the robot objects

00:58:18.420 --> 00:58:20.500
that it can see through a camera lens.

00:58:20.500 --> 00:58:24.445
So my colleague Alexey
Kurakin and Samy Bengio and I

00:58:24.445 --> 00:58:27.806
wrote a paper where we studied
if we can actually fool

00:58:27.806 --> 00:58:30.313
an object recognition
system running on a phone,

00:58:30.313 --> 00:58:33.205
where it perceives the
world through a camera.

00:58:33.205 --> 00:58:35.345
Our methodology was
really straightforward.

00:58:35.345 --> 00:58:36.894
We just printed out several pictures

00:58:36.894 --> 00:58:38.654
of adversarial examples.

00:58:38.654 --> 00:58:41.988
And we found that the
object recognition system

00:58:41.988 --> 00:58:44.430
run by the camera was fooled by them.

00:58:44.430 --> 00:58:46.489
The system on the camera
is actually different

00:58:46.489 --> 00:58:47.886
from the model that we used

00:58:47.886 --> 00:58:49.550
to generate the adversarial examples.

00:58:49.550 --> 00:58:53.379
So we're showing not just transfer across

00:58:53.379 --> 00:58:55.826
the changes that happen
when you use the camera,

00:58:55.826 --> 00:58:58.009
we're also showing that
those transfer across

00:58:58.009 --> 00:59:00.022
the model that you use.

00:59:00.022 --> 00:59:02.692
So the attacker could conceivably fool

00:59:02.692 --> 00:59:05.267
a system that's deployed
in a physical agent,

00:59:05.267 --> 00:59:07.950
even if they don't have access
to the model on that agent

00:59:07.950 --> 00:59:11.539
and even if they can't interface
directly with the agent

00:59:11.539 --> 00:59:13.372
but just subtly modify

00:59:15.566 --> 00:59:19.085
objects that it can
see in its environment.

00:59:19.085 --> 00:59:20.183
Yeah?

00:59:20.183 --> 00:59:22.434
- [Man] Why does the,

00:59:22.434 --> 00:59:24.408
for the low quality camera image noise

00:59:24.408 --> 00:59:26.586
not affect the adversarial example?

00:59:26.586 --> 00:59:28.311
Because that's what one would expect.

00:59:28.311 --> 00:59:30.023
- Yeah, so I think a lot of that

00:59:30.023 --> 00:59:34.071
comes back to the maps
that I showed earlier.

00:59:34.071 --> 00:59:36.614
If you cross over the
boundary into the realm

00:59:36.614 --> 00:59:38.426
of adversarial examples,

00:59:38.426 --> 00:59:40.846
they occupy a pretty wide space

00:59:40.846 --> 00:59:43.348
and they're very densely packed in there.

00:59:43.348 --> 00:59:45.108
So if you jostle around a little bit,

00:59:45.108 --> 00:59:48.590
you're not going to recover
from the adversarial attack.

00:59:48.590 --> 00:59:50.628
If the camera noise, somehow or other,

00:59:50.628 --> 00:59:53.966
was aligned with the negative
gradient of the cost,

00:59:53.966 --> 00:59:57.383
then the camera could take a
gradient descent step downhill

00:59:57.383 --> 01:00:01.407
and rescue you from the uphill
step that the adversary took.

01:00:01.407 --> 01:00:03.252
But probably the camera's
taking more or less

01:00:03.252 --> 01:00:06.699
something that you could
model as a random direction.

01:00:06.699 --> 01:00:09.324
Like clearly when you use
the camera more than once

01:00:09.324 --> 01:00:11.902
it's going to do the same thing each time,

01:00:11.902 --> 01:00:15.129
but from the point of
view of how that direction

01:00:15.129 --> 01:00:18.868
relates to the image
classification problem,

01:00:18.868 --> 01:00:22.281
it's more or less a random
variable that you sample once.

01:00:22.281 --> 01:00:25.025
And it seems unlikely to align exactly

01:00:25.025 --> 01:00:28.275
with the normal to this class boundary.

01:00:33.238 --> 01:00:36.762
There's a lot of different
defenses that we'd like to build.

01:00:36.762 --> 01:00:39.425
And it's a little bit disappointing

01:00:39.425 --> 01:00:41.265
that I'm mostly here to
tell you about attacks.

01:00:41.265 --> 01:00:44.088
I'd like to tell you how to
make your systems more robust.

01:00:44.088 --> 01:00:47.332
But basically every attack we've tried

01:00:47.332 --> 01:00:49.192
has failed pretty badly.

01:00:49.192 --> 01:00:52.329
And in fact, even when
people have published

01:00:52.329 --> 01:00:54.996
that they successfully defended.

01:00:55.927 --> 01:00:57.833
Well, there's been several papers on arXiv

01:00:57.833 --> 01:00:59.892
over the last several months.

01:00:59.892 --> 01:01:02.873
Nicholas Carlini at Berkeley
just released a paper

01:01:02.873 --> 01:01:07.710
where he shows that 10 of
those defenses are broken.

01:01:07.710 --> 01:01:09.870
So this is a really, really hard problem.

01:01:09.870 --> 01:01:11.849
You can't just make it go away by using

01:01:11.849 --> 01:01:15.630
traditional regularization techniques.

01:01:15.630 --> 01:01:18.328
Particular, generative
models are not enough

01:01:18.328 --> 01:01:19.649
to solve the problem.

01:01:19.649 --> 01:01:21.366
A lot of people say, "Oh the
problem that's going on here

01:01:21.366 --> 01:01:22.998
"is you don't know anything
about the distribution

01:01:22.998 --> 01:01:25.343
"over the input pixels.

01:01:25.343 --> 01:01:26.577
"If you could just tell

01:01:26.577 --> 01:01:28.164
"whether the input is realistic or not

01:01:28.164 --> 01:01:31.141
"then you'd be able to resist it."

01:01:31.141 --> 01:01:33.469
It turns out that what's going on here is

01:01:33.469 --> 01:01:36.284
what matters more than getting
the right distributions

01:01:36.284 --> 01:01:37.566
over the inputs x,

01:01:37.566 --> 01:01:39.305
is getting the right
posterior distribution

01:01:39.305 --> 01:01:42.366
over the class of labels y given inputs x.

01:01:42.366 --> 01:01:44.665
So just using a generative model

01:01:44.665 --> 01:01:46.905
is not enough to solve the problem.

01:01:46.905 --> 01:01:49.095
I think a very carefully
designed generative model

01:01:49.095 --> 01:01:51.070
could possibly do it.

01:01:51.070 --> 01:01:54.729
Here I show two different modes
of a bimodal distribution,

01:01:54.729 --> 01:01:56.446
and we have two different
generative models

01:01:56.446 --> 01:01:58.948
that try to capture these modes.

01:01:58.948 --> 01:02:01.348
On the left we have a
mixture of two Gaussians.

01:02:01.348 --> 01:02:04.148
On the right we have a
mixture of two Laplacians.

01:02:04.148 --> 01:02:06.395
You can not really tell
the difference visually

01:02:06.395 --> 01:02:09.506
between the distribution
they impose over x,

01:02:09.506 --> 01:02:11.601
and the difference in the
likelihood they assign

01:02:11.601 --> 01:02:13.929
to the training data is negligible.

01:02:13.929 --> 01:02:16.158
But the posterior distribution
they assign over classes

01:02:16.158 --> 01:02:17.886
is extremely different.

01:02:17.886 --> 01:02:20.488
On the left we get a logistic
regression classifier

01:02:20.488 --> 01:02:22.833
that has very high confidence

01:02:22.833 --> 01:02:25.143
out in the tails of the distribution

01:02:25.143 --> 01:02:27.049
where there is never any training data.

01:02:27.049 --> 01:02:29.108
On the right, with the
Laplacian distribution,

01:02:29.108 --> 01:02:32.025
we level off to more or less 50-50.

01:02:33.156 --> 01:02:33.989
Yeah?

01:02:33.989 --> 01:02:37.156
[speaker drowned out]

01:02:44.052 --> 01:02:46.666
The issue is that it's a
nonstationary distribution.

01:02:46.666 --> 01:02:48.052
So if you train it to recognize

01:02:48.052 --> 01:02:49.834
one kind of adversarial example,

01:02:49.834 --> 01:02:52.170
then it will become
vulnerable to another kind

01:02:52.170 --> 01:02:55.871
that's designed to fool its detector.

01:02:55.871 --> 01:02:59.631
That's one of the category of
defenses that Nicholas broke

01:02:59.631 --> 01:03:02.631
in his latest paper that he put out.

01:03:04.667 --> 01:03:07.231
So here basically the choice of exactly

01:03:07.231 --> 01:03:09.370
the family of generative
model has a big effect

01:03:09.370 --> 01:03:13.537
in whether the posterior becomes
deterministic or uniform,

01:03:14.765 --> 01:03:17.348
as the model extrapolates.

01:03:17.348 --> 01:03:21.212
And if we could design a really
rich, deep generative model

01:03:21.212 --> 01:03:24.387
that can generate
realistic ImageNet images

01:03:24.387 --> 01:03:28.012
and also correctly calculate
its posterior distribution,

01:03:28.012 --> 01:03:31.389
then maybe something like
this approach could work.

01:03:31.389 --> 01:03:33.072
But at the moment it's
really difficult to get

01:03:33.072 --> 01:03:36.029
any of those probabilistic
calculations correct.

01:03:36.029 --> 01:03:38.273
And what usually happens is,

01:03:38.273 --> 01:03:40.012
somewhere or other we
make an approximation

01:03:40.012 --> 01:03:42.156
that causes the posterior distribution

01:03:42.156 --> 01:03:45.553
to extrapolate very linearly again.

01:03:45.553 --> 01:03:48.476
It's been a difficult
engineering challenge

01:03:48.476 --> 01:03:50.135
to build generative models

01:03:50.135 --> 01:03:54.302
that actually capture these
distributions accurately.

01:03:55.772 --> 01:03:58.681
The universal approximator
theorem tells us that

01:03:58.681 --> 01:04:00.273
whatever shape we would like

01:04:00.273 --> 01:04:02.850
our classification function to have,

01:04:02.850 --> 01:04:04.375
a neural net that's big enough

01:04:04.375 --> 01:04:06.407
ought to be able to represent it.

01:04:06.407 --> 01:04:08.505
It's an open question whether
we can train the neural net

01:04:08.505 --> 01:04:09.750
to have that function,

01:04:09.750 --> 01:04:11.622
but we know that we should be able to

01:04:11.622 --> 01:04:13.340
at least give the right shape.

01:04:13.340 --> 01:04:15.188
So so far we've been getting neural nets

01:04:15.188 --> 01:04:18.369
that give us these very
linear decision functions,

01:04:18.369 --> 01:04:19.569
and we'd like to get something

01:04:19.569 --> 01:04:21.743
that looks a little bit
more like a step function.

01:04:21.743 --> 01:04:25.111
So what if we actually just
train on adversarial examples?

01:04:25.111 --> 01:04:27.545
For every input x in the training set,

01:04:27.545 --> 01:04:31.727
we also say we want you to
train x plus an attack to map

01:04:31.727 --> 01:04:34.252
to the same class label as the original.

01:04:34.252 --> 01:04:37.187
It turns out that this sort of works.

01:04:37.187 --> 01:04:39.111
You can generally resist

01:04:39.111 --> 01:04:41.388
the same kind of attack that you train on.

01:04:41.388 --> 01:04:43.786
And an important consideration

01:04:43.786 --> 01:04:46.151
is making sure that you could
run your attack very quickly

01:04:46.151 --> 01:04:48.508
so that you can train on lots of examples.

01:04:48.508 --> 01:04:51.089
So here the green curve at the very top,

01:04:51.089 --> 01:04:53.466
the one that doesn't
really descend much at all,

01:04:53.466 --> 01:04:56.188
that's the test set error
on adversarial examples

01:04:56.188 --> 01:04:59.188
if you train on clean examples only.

01:05:00.127 --> 01:05:03.889
The cyan curve that descends
more or less diagonally

01:05:03.889 --> 01:05:05.292
through the middle of the plot,

01:05:05.292 --> 01:05:07.889
that's the tester on adversarial examples

01:05:07.889 --> 01:05:10.746
if you train on adversarial examples.

01:05:10.746 --> 01:05:13.649
You can see that it does
actually reduce significantly.

01:05:13.649 --> 01:05:16.711
It gets down to a little
bit less than 1% error.

01:05:16.711 --> 01:05:20.012
And the important thing to
keep in mind here is that

01:05:20.012 --> 01:05:23.524
this is fast gradient sign
method adversarial examples.

01:05:23.524 --> 01:05:24.872
It's much harder to resist

01:05:24.872 --> 01:05:27.649
iterative multi-step adversarial examples

01:05:27.649 --> 01:05:29.468
where you run an optimizer for a long time

01:05:29.468 --> 01:05:31.924
searching for a vulnerability.

01:05:31.924 --> 01:05:33.128
And another thing to keep in mind

01:05:33.128 --> 01:05:34.063
is that we're testing on

01:05:34.063 --> 01:05:36.525
the same kind of adversarial
examples that we train on.

01:05:36.525 --> 01:05:37.772
It's harder to generalize

01:05:37.772 --> 01:05:42.141
from one optimization
algorithm to another.

01:05:42.141 --> 01:05:44.558
By comparison, if you look at

01:05:46.881 --> 01:05:48.727
what happens on clean examples,

01:05:48.727 --> 01:05:50.385
the blue curve shows what happens

01:05:50.385 --> 01:05:53.089
on the clean test set error rate

01:05:53.089 --> 01:05:55.687
if you train only on clean examples.

01:05:55.687 --> 01:05:57.249
The red curve shows what happens

01:05:57.249 --> 01:06:01.260
if you train on both clean
and adversarial examples.

01:06:01.260 --> 01:06:02.449
We see that the red curve

01:06:02.449 --> 01:06:04.967
actually drops lower than the blue curve.

01:06:04.967 --> 01:06:07.445
So on this task, training
on adversarial examples

01:06:07.445 --> 01:06:10.188
actually helped us to do
the original task better.

01:06:10.188 --> 01:06:12.625
This is because in the original
task we were overfitting.

01:06:12.625 --> 01:06:15.544
Training on adversarial
examples is good regularizer.

01:06:15.544 --> 01:06:18.202
If you're overfitting it
can make you overfit less.

01:06:18.202 --> 01:06:21.700
If you're underfitting it'll
just make you underfit worse.

01:06:21.700 --> 01:06:24.562
Other kinds of models
besides deep neural nets

01:06:24.562 --> 01:06:27.287
don't benefit as much
from adversarial training.

01:06:27.287 --> 01:06:29.525
So when we started this
whole topic of study

01:06:29.525 --> 01:06:30.764
we thought that deep neural nets

01:06:30.764 --> 01:06:33.338
might be uniquely vulnerable
to adversarial examples.

01:06:33.338 --> 01:06:35.084
But it turns out that actually

01:06:35.084 --> 01:06:36.625
they're one of the few models that has

01:06:36.625 --> 01:06:38.916
a clear path to resisting them.

01:06:38.916 --> 01:06:40.957
Linear models are just
always going to be linear.

01:06:40.957 --> 01:06:44.204
They don't have much hope of
resisting adversarial examples.

01:06:44.204 --> 01:06:46.423
Deep neural nets can be
trained to be nonlinear,

01:06:46.423 --> 01:06:50.955
and so it seems like there's
a path to a solution for them.

01:06:50.955 --> 01:06:52.261
Even with adversarial training,

01:06:52.261 --> 01:06:55.418
we still find that we aren't able to

01:06:55.418 --> 01:06:57.578
make models where if
you optimize the input

01:06:57.578 --> 01:06:59.063
to belong to different classes,

01:06:59.063 --> 01:07:01.129
you get examples in those classes.

01:07:01.129 --> 01:07:04.844
Here I start with a CIFAR-10
truck and I turn it into

01:07:04.844 --> 01:07:07.935
each of the 10 different CIFAR-10 classes.

01:07:07.935 --> 01:07:09.244
Toward the middle of the plot

01:07:09.244 --> 01:07:10.651
you can see that the truck has started

01:07:10.651 --> 01:07:12.201
to look a little bit like a bird.

01:07:12.201 --> 01:07:13.736
But the bird class is the only one

01:07:13.736 --> 01:07:15.897
that we've come anywhere near hitting.

01:07:15.897 --> 01:07:17.404
So even with adversarial training,

01:07:17.404 --> 01:07:21.876
we're still very far from
solving this problem.

01:07:21.876 --> 01:07:23.180
When we do adversarial training,

01:07:23.180 --> 01:07:25.500
we rely on having labels
for all the examples.

01:07:25.500 --> 01:07:27.340
We have an image that's labeled as a bird.

01:07:27.340 --> 01:07:28.975
We make a perturbation that's designed

01:07:28.975 --> 01:07:30.903
to decrease the probability
of the bird class,

01:07:30.903 --> 01:07:32.161
and we train the model

01:07:32.161 --> 01:07:33.863
that the image should still be a bird.

01:07:33.863 --> 01:07:35.483
But what if you don't have labels?

01:07:35.483 --> 01:07:39.299
It turns out that you can
actually train without labels.

01:07:39.299 --> 01:07:42.700
You ask the model to predict
the label of the first image.

01:07:42.700 --> 01:07:44.298
So if you've trained for a little while

01:07:44.298 --> 01:07:45.697
and your model isn't perfect yet,

01:07:45.697 --> 01:07:47.804
it might say, oh, maybe this
is a bird, maybe it's a plane.

01:07:47.804 --> 01:07:49.324
There's some blue sky there,

01:07:49.324 --> 01:07:51.550
I'm not sure which of
these two classes it is.

01:07:51.550 --> 01:07:53.714
Then we make an adversarial perturbation

01:07:53.714 --> 01:07:55.759
that's intended to change the guess

01:07:55.759 --> 01:07:58.159
and we just try to make it
say, oh this is a truck,

01:07:58.159 --> 01:07:59.357
or something like that.

01:07:59.357 --> 01:08:01.236
It's not whatever you
believed it was before.

01:08:01.236 --> 01:08:02.983
You can then train it to say

01:08:02.983 --> 01:08:04.481
that the distribution of our classes

01:08:04.481 --> 01:08:06.557
should still be the same as it was before,

01:08:06.557 --> 01:08:08.343
but this should still be considered

01:08:08.343 --> 01:08:10.600
probably a bird or a plane.

01:08:10.600 --> 01:08:12.752
This technique is called
virtual adversarial training,

01:08:12.752 --> 01:08:15.176
and it was invented by Takeru Miyato.

01:08:15.176 --> 01:08:18.524
He was my Intern at Google
after he did this work.

01:08:18.524 --> 01:08:22.720
At Google we invited him to
come and apply his invention

01:08:22.720 --> 01:08:24.637
to text classification,

01:08:25.783 --> 01:08:29.500
because this ability to
learn from unlabeled examples

01:08:29.500 --> 01:08:32.380
makes it possible to do
semi-supervised learning

01:08:32.380 --> 01:08:35.921
where you learn from both
unlabeled and labeled examples.

01:08:35.921 --> 01:08:38.818
And there's quite a lot of
unlabeled text in the world.

01:08:38.818 --> 01:08:41.142
So we were able to bring
down the error rate

01:08:41.142 --> 01:08:43.761
on several different
text classification tasks

01:08:43.761 --> 01:08:47.804
by using this virtual
adversarial training.

01:08:47.804 --> 01:08:49.761
Finally, there's a lot of problems where

01:08:49.761 --> 01:08:52.001
we'd like to use neural nets

01:08:52.001 --> 01:08:54.122
to guide optimization procedures.

01:08:54.122 --> 01:08:57.243
If we want to make a very, very fast car,

01:08:57.243 --> 01:08:59.510
we could imagine a neural net that looks

01:08:59.511 --> 01:09:00.996
at the blueprints for a car

01:09:00.996 --> 01:09:02.743
and predicts how fast it will go.

01:09:02.743 --> 01:09:04.337
If we could then optimize

01:09:04.337 --> 01:09:06.379
with respect to the
input of the neural net

01:09:06.380 --> 01:09:07.600
and find the blueprint

01:09:07.600 --> 01:09:09.303
that it predicts would go the fastest,

01:09:09.303 --> 01:09:11.622
we could build an incredibly fast car.

01:09:11.622 --> 01:09:13.473
Unfortunately, what we get right now

01:09:13.474 --> 01:09:14.975
is not a blueprint for a fast car.

01:09:14.975 --> 01:09:16.959
We get an adversarial
example that the model

01:09:16.959 --> 01:09:18.912
thinks is going to be very fast.

01:09:18.912 --> 01:09:21.758
If we're able to solve the
adversarial example problem,

01:09:21.759 --> 01:09:23.063
we'll be able to solve

01:09:23.063 --> 01:09:25.201
this model-based optimization problem.

01:09:25.201 --> 01:09:27.580
I like to call model-based optimization

01:09:27.580 --> 01:09:29.884
the universal engineering machine.

01:09:29.884 --> 01:09:32.300
If we're able to do
model-based optimization,

01:09:32.300 --> 01:09:34.060
we'll be able to write down
a function that describes

01:09:34.060 --> 01:09:37.540
a thing that doesn't exist
yet but we wish that we had.

01:09:37.540 --> 01:09:39.622
And then gradient descent and neural nets

01:09:39.622 --> 01:09:41.339
will figure out how to build it for us.

01:09:41.340 --> 01:09:44.040
We can use that to design
new genes and new molecules

01:09:44.040 --> 01:09:45.420
for medicinal drugs,

01:09:45.420 --> 01:09:46.753
and new circuits

01:09:48.836 --> 01:09:51.857
to make GPUs run faster
and things like that.

01:09:51.857 --> 01:09:53.697
So I think overall, solving this problem

01:09:53.697 --> 01:09:58.060
could unlock a lot of potential
technological advances.

01:09:58.060 --> 01:10:00.439
In conclusion, attacking
machine learning models

01:10:00.439 --> 01:10:01.660
is extremely easy,

01:10:01.660 --> 01:10:03.886
and defending them is extremely difficult.

01:10:03.886 --> 01:10:06.017
If you use adversarial training

01:10:06.017 --> 01:10:07.841
you can get a little bit of a defense,

01:10:07.841 --> 01:10:09.297
but there's still many caveats

01:10:09.297 --> 01:10:11.079
associated with that defense.

01:10:11.079 --> 01:10:13.500
Adversarial training and
virtual adversarial training

01:10:13.500 --> 01:10:16.240
also make it possible
to regularize your model

01:10:16.240 --> 01:10:18.119
and even learn from unlabeled data

01:10:18.119 --> 01:10:21.031
so you can do better on
regular test examples,

01:10:21.031 --> 01:10:23.841
even if you're not concerned
about facing an adversary.

01:10:23.841 --> 01:10:26.460
And finally, if we're able to
solve all of these problems,

01:10:26.460 --> 01:10:29.757
we'll be able to build a black
box model-based optimization

01:10:29.757 --> 01:10:32.620
system that can solve all
kinds of engineering problems

01:10:32.620 --> 01:10:35.597
that are holding us back
in many different fields.

01:10:35.597 --> 01:10:39.697
I think I have a few
minutes left for questions.

01:10:39.697 --> 01:10:42.697
[audience applauds]

01:10:47.631 --> 01:10:50.798
[speaker drowned out]

01:10:57.256 --> 01:10:58.089
Yeah.

01:11:15.218 --> 01:11:16.051
Oh, so,

01:11:16.973 --> 01:11:18.618
there's some determinism

01:11:18.618 --> 01:11:22.493
to the choice of those 50 directions.

01:11:22.493 --> 01:11:23.496
Oh right, yeah.

01:11:23.496 --> 01:11:24.637
So repeating the questions.

01:11:24.637 --> 01:11:26.261
I've said that the same perturbation

01:11:26.261 --> 01:11:27.676
can fool many different models

01:11:27.676 --> 01:11:29.221
or the same perturbation can be applied

01:11:29.221 --> 01:11:31.599
to many different clean examples.

01:11:31.599 --> 01:11:33.162
I've also said that the subspace

01:11:33.162 --> 01:11:37.141
of adversarial perturbations
is only about 50 dimensional,

01:11:37.141 --> 01:11:40.938
even if the input dimension
is 3,000 dimensional.

01:11:40.938 --> 01:11:43.722
So how is it that these
subspaces intersect?

01:11:43.722 --> 01:11:47.402
The reason is that the choice
of the subspace directions

01:11:47.402 --> 01:11:49.077
is not completely random.

01:11:49.077 --> 01:11:51.595
It's generally going to be something like

01:11:51.595 --> 01:11:55.525
pointing from one class centroid
to another class centroid.

01:11:55.525 --> 01:11:59.692
And if you look at that vector
and visualize it as an image,

01:12:00.565 --> 01:12:03.138
it might not be meaningful to a human

01:12:03.138 --> 01:12:04.362
just because humans aren't very good

01:12:04.362 --> 01:12:06.717
at imagining what class
centroids look like.

01:12:06.717 --> 01:12:07.946
And we're really bad at imagining

01:12:07.946 --> 01:12:10.140
differences between centroids.

01:12:10.140 --> 01:12:12.553
But there is more or less
this systematic effect

01:12:12.553 --> 01:12:14.868
that causes different models to learn

01:12:14.868 --> 01:12:17.000
similar linear functions,

01:12:17.000 --> 01:12:21.167
just because they're trying
to solve the same task.

01:12:22.282 --> 01:12:25.449
[speaker drowned out]

01:12:27.386 --> 01:12:29.359
Yeah, so the question is,
is it possible to identify

01:12:29.359 --> 01:12:33.573
which layer contributes
the most to this issue?

01:12:33.573 --> 01:12:35.656
One thing is that if you,

01:12:36.697 --> 01:12:39.002
the last layer is somewhat important.

01:12:39.002 --> 01:12:42.653
Because, say that you
made a feature extractor

01:12:42.653 --> 01:12:45.263
that's completely robust to
adversarial perturbations

01:12:45.263 --> 01:12:48.783
and can shrink them to
be very, very small,

01:12:48.783 --> 01:12:51.022
and then the last layer is still linear.

01:12:51.022 --> 01:12:53.781
Then it has all the problems
that are typically associated

01:12:53.781 --> 01:12:55.364
with linear models.

01:12:57.667 --> 01:13:00.157
And generally you can
do adversarial training

01:13:00.157 --> 01:13:02.157
where you perturb all
the different layers,

01:13:02.157 --> 01:13:04.042
all the hidden layers
as well as the input.

01:13:04.042 --> 01:13:06.379
In this lecture I only
described perturbing the input

01:13:06.379 --> 01:13:07.653
because it seems like that's where

01:13:07.653 --> 01:13:09.145
most of the benefit comes from.

01:13:09.145 --> 01:13:11.445
The one thing that you can't
do with adversarial training

01:13:11.445 --> 01:13:14.279
is perturb the very last
layer before the softmax,

01:13:14.279 --> 01:13:15.946
because that linear layer at the end

01:13:15.946 --> 01:13:18.661
has no way of learning to
resist the perturbations.

01:13:18.661 --> 01:13:20.740
Doing adversarial training at that layer

01:13:20.740 --> 01:13:23.410
usually just breaks the whole process.

01:13:23.410 --> 01:13:27.896
But other than that, it
seems very problem dependent.

01:13:27.896 --> 01:13:30.741
There's a paper by Sara
Sabour and her collaborators

01:13:30.741 --> 01:13:34.238
called Adversarial Manipulation
of Deep Representations,

01:13:34.238 --> 01:13:36.536
where they design adversarial examples

01:13:36.536 --> 01:13:41.439
that are intended to fool
different layers of the net.

01:13:41.439 --> 01:13:43.225
They report some things about, like,

01:13:43.225 --> 01:13:45.418
how large of a perturbation
is needed at the input

01:13:45.418 --> 01:13:47.338
to get different sizes of perturbation

01:13:47.338 --> 01:13:49.061
at different hidden layers.

01:13:49.061 --> 01:13:50.858
I suspect that if you trained the model

01:13:50.858 --> 01:13:52.616
to resist perturbations at one layer,

01:13:52.616 --> 01:13:54.315
then another layer would
become more vulnerable

01:13:54.315 --> 01:13:57.398
and it would be like a moving target.

01:14:00.901 --> 01:14:04.068
[speaker drowned out]

01:14:09.775 --> 01:14:10.778
Yes, so the question is,

01:14:10.778 --> 01:14:12.197
how many adversarial examples are needed

01:14:12.197 --> 01:14:15.797
to improve the misclassification rate?

01:14:15.797 --> 01:14:20.200
Some of our plots we
include learning curves.

01:14:20.200 --> 01:14:22.157
Or some of our papers we
include learning curves,

01:14:22.157 --> 01:14:24.157
so you can actually see,

01:14:25.138 --> 01:14:26.602
like in this one here.

01:14:26.602 --> 01:14:29.874
Every time we do an epoch
we've generated the same

01:14:29.874 --> 01:14:31.503
number of adversarial examples

01:14:31.503 --> 01:14:33.525
as there are training examples.

01:14:33.525 --> 01:14:37.701
So every epoch here is
50,000 adversarial examples.

01:14:37.701 --> 01:14:41.056
You can see that adversarial
training is a very

01:14:41.056 --> 01:14:43.381
data hungry process.

01:14:43.381 --> 01:14:45.861
You need to make new adversarial examples

01:14:45.861 --> 01:14:47.781
every time you update the weights.

01:14:47.781 --> 01:14:51.112
And they're constantly
changing in reaction to

01:14:51.112 --> 01:14:54.862
whatever the model has
learned most recently.

01:14:55.861 --> 01:14:59.028
[speaker drowned out]

01:15:07.264 --> 01:15:10.514
Oh, the model-based optimization, yeah.

01:15:11.837 --> 01:15:13.853
Yeah, so the question is just to

01:15:13.853 --> 01:15:16.277
elaborate further on this problem.

01:15:16.277 --> 01:15:20.341
So most of the time that we
have a machine learning model,

01:15:20.341 --> 01:15:23.701
it's something like a
classifier or a regression model

01:15:23.701 --> 01:15:26.741
where we give it an
input from the test set

01:15:26.741 --> 01:15:29.040
and it gives us an output.

01:15:29.040 --> 01:15:31.474
And usually that input
is randomly occurring

01:15:31.474 --> 01:15:34.981
and comes from the same
distribution as the training set.

01:15:34.981 --> 01:15:37.178
We usually just run the
model, get its prediction,

01:15:37.178 --> 01:15:39.435
and then we're done with it.

01:15:39.435 --> 01:15:42.019
Sometimes we have feedback loops,

01:15:42.019 --> 01:15:44.297
like for recommender systems.

01:15:44.297 --> 01:15:47.547
If you work at Netflix and you recommend

01:15:47.547 --> 01:15:50.707
a movie to a viewer,
then they're more likely

01:15:50.707 --> 01:15:52.757
to watch that movie and then rate it,

01:15:52.757 --> 01:15:54.661
and then there's going
to be more ratings of it

01:15:54.661 --> 01:15:55.658
in your training set

01:15:55.658 --> 01:15:57.440
so you'll recommend it to
more people in the future.

01:15:57.440 --> 01:15:58.661
So there's this feedback loop

01:15:58.661 --> 01:16:00.936
from the output of your
model to the input.

01:16:00.936 --> 01:16:04.677
Most of the time when we
build machine vision systems,

01:16:04.677 --> 01:16:08.522
there's no feedback loop from
their output to their input.

01:16:08.522 --> 01:16:09.541
If we imagine a setting

01:16:09.541 --> 01:16:11.440
where we start using an
optimization algorithm

01:16:11.440 --> 01:16:15.607
to find inputs that maximize
some property of the output,

01:16:17.298 --> 01:16:18.842
like if we have a model that looks

01:16:18.842 --> 01:16:20.602
at the blueprints of a car

01:16:20.602 --> 01:16:24.122
and outputs the expected speed of the car,

01:16:24.122 --> 01:16:27.498
then we could use gradient ascent

01:16:27.498 --> 01:16:29.578
to look for the blueprints that correspond

01:16:29.578 --> 01:16:31.895
to the fastest possible car.

01:16:31.895 --> 01:16:33.674
Or for example if we're
designing a medicine,

01:16:33.674 --> 01:16:36.618
we could look for the molecular structure

01:16:36.618 --> 01:16:40.842
that we think is most likely
to cure some form of cancer,

01:16:40.842 --> 01:16:42.720
or the least likely to cause

01:16:42.720 --> 01:16:45.976
some kind of liver toxicity effect.

01:16:45.976 --> 01:16:49.162
The problem is that once
we start using optimization

01:16:49.162 --> 01:16:50.720
to look for these inputs

01:16:50.720 --> 01:16:53.061
that maximize the output of the model,

01:16:53.061 --> 01:16:56.761
the input is no longer
an independent sample

01:16:56.761 --> 01:16:58.202
from the same distribution

01:16:58.202 --> 01:17:00.557
as we used at the training set time.

01:17:00.557 --> 01:17:04.202
The model is now guiding the process

01:17:04.202 --> 01:17:06.218
that generates the data.

01:17:06.218 --> 01:17:10.385
So we end up finding essentially
adversarial examples.

01:17:11.246 --> 01:17:13.104
Instead of the model telling us

01:17:13.104 --> 01:17:15.242
how we can improve the input,

01:17:15.242 --> 01:17:16.901
what we usually find in practice

01:17:16.901 --> 01:17:19.720
is that we've got an
input that fools the model

01:17:19.720 --> 01:17:23.141
into thinking that the input
corresponds to something great.

01:17:23.141 --> 01:17:26.282
So we'd find molecules that are very toxic

01:17:26.282 --> 01:17:28.901
but the model thinks
they're very non-toxic.

01:17:28.901 --> 01:17:30.464
Or we'd find cars that are very slow

01:17:30.464 --> 01:17:33.381
but the model thinks are very fast.

01:17:35.621 --> 01:17:38.788
[speaker drowned out]

01:17:54.678 --> 01:17:56.017
Yeah, so the question is,

01:17:56.017 --> 01:17:58.859
here the frog class is boosted by going

01:17:58.859 --> 01:18:01.936
in either the positive or
negative adversarial direction.

01:18:01.936 --> 01:18:06.276
And in some of the other
slides, like these maps,

01:18:06.276 --> 01:18:09.217
you don't get that effect
where subtracting epsilon off

01:18:09.217 --> 01:18:12.097
eventually boosts the adversarial class.

01:18:12.097 --> 01:18:13.819
Part of what's going on is

01:18:13.819 --> 01:18:16.496
I think I'm using larger epsilon here.

01:18:16.496 --> 01:18:18.135
And so you might
eventually see that effect

01:18:18.135 --> 01:18:20.038
if I'd made these maps wider.

01:18:20.038 --> 01:18:21.627
I made the maps narrower because

01:18:21.627 --> 01:18:25.034
it's like quadratic time to build a 2D map

01:18:25.034 --> 01:18:29.639
and it's linear time to
build a 1D cross section.

01:18:29.639 --> 01:18:33.197
So I just didn't afford the GPU time

01:18:33.197 --> 01:18:35.278
to make the maps quite as wide.

01:18:35.278 --> 01:18:37.009
I also think that this might just be

01:18:37.009 --> 01:18:39.999
a weird effect that happened
randomly on this one example.

01:18:39.999 --> 01:18:42.742
It's not something that I
remember being used to seeing

01:18:42.742 --> 01:18:43.878
a lot of the time.

01:18:43.878 --> 01:18:45.441
Most things that I observe

01:18:45.441 --> 01:18:47.495
don't happen perfectly consistently.

01:18:47.495 --> 01:18:50.582
But if they happen, like, 80% of the time

01:18:50.582 --> 01:18:52.598
then I'll put them in my slide.

01:18:52.598 --> 01:18:54.823
A lot of what we're doing is
trying trying to figure out

01:18:54.823 --> 01:18:56.118
more or less what's going on,

01:18:56.118 --> 01:18:58.641
and so if we find that something
happens 80% of the time,

01:18:58.641 --> 01:19:02.198
then I consider it to be
the dominant phenomenon

01:19:02.198 --> 01:19:03.934
that we're trying to explain.

01:19:03.934 --> 01:19:06.102
And after we've got a
better explanation for that

01:19:06.102 --> 01:19:07.739
then I might start to try to explain

01:19:07.739 --> 01:19:09.276
some of the weirder things that happen,

01:19:09.276 --> 01:19:13.109
like the frog happening
with negative epsilon.

01:19:15.415 --> 01:19:18.582
[speaker drowned out]

01:19:22.436 --> 01:19:24.062
I didn't fully understand the question.

01:19:24.062 --> 01:19:28.145
It's about the dimensionality
of the adversarial?

01:19:34.484 --> 01:19:35.801
Oh, okay.

01:19:35.801 --> 01:19:37.504
So the question is, how is the dimension

01:19:37.504 --> 01:19:39.243
of the adversarial subspace related

01:19:39.243 --> 01:19:40.827
to the dimension of the input?

01:19:40.827 --> 01:19:44.078
And my answer is somewhat embarrassing,

01:19:44.078 --> 01:19:47.042
which is that we've only run
this method on two datasets,

01:19:47.042 --> 01:19:49.926
so we actually don't have a good idea yet.

01:19:49.926 --> 01:19:53.526
But I think it's something
interesting to study.

01:19:53.526 --> 01:19:57.104
If I remember correctly, my
coauthors open sourced our code.

01:19:57.104 --> 01:19:59.323
So you could probably run it on ImageNet

01:19:59.323 --> 01:20:01.406
without too much trouble.

01:20:02.261 --> 01:20:04.150
My contribution to that paper was in

01:20:04.150 --> 01:20:06.066
the week that I was unemployed

01:20:06.066 --> 01:20:09.417
between working at OpenAI
and working at Google,

01:20:09.417 --> 01:20:11.030
so I had access to no GPUS

01:20:11.030 --> 01:20:14.288
and I ran that experiment
on my laptop on CPU,

01:20:14.288 --> 01:20:18.455
so it's only really small
datasets. [chuckles]

01:20:19.766 --> 01:20:22.933
[speaker drowned out]

01:20:40.233 --> 01:20:44.248
Oh, so the question is,
do we end up perturbing

01:20:44.248 --> 01:20:47.695
clean examples to low
confidence adversarial examples?

01:20:47.695 --> 01:20:50.633
Yeah, in practice we usually find that

01:20:50.633 --> 01:20:53.843
we can get very high confidence
on the output examples.

01:20:53.843 --> 01:20:57.156
One thing in high dimensions
that's a little bit unintuitive

01:20:57.156 --> 01:21:00.313
is that just getting the sign right

01:21:00.313 --> 01:21:03.353
on very many of the input pixels

01:21:03.353 --> 01:21:06.516
is enough to get a really strong response.

01:21:06.516 --> 01:21:09.845
So the angle between the weight vector

01:21:09.845 --> 01:21:13.492
matters a lot more than
the exact coordinates

01:21:13.492 --> 01:21:15.825
in high dimensional systems.

01:21:18.255 --> 01:21:20.087
Does that make enough sense?

01:21:20.087 --> 01:21:21.004
Yeah, okay.

01:21:21.868 --> 01:21:23.673
- [Man] So we're actually
going to [mumbles].

01:21:23.673 --> 01:21:26.095
So if you guys need to leave, that's fine.

01:21:26.095 --> 01:21:28.175
But let's thank our speaker one more time

01:21:28.175 --> 00:00:00.000
for getting--
[audience applauds]